You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Kjell Tore Fossbakk <kj...@gmail.com> on 2015/04/22 14:50:28 UTC

Parsing and moving data to ORC from HDFS

Hello user@hive.apache.org

I have about 100 TB of data, approximately 180 billion events, in my HDFS
cluster. It is my raw data stored as GZIP files. At the time of setup this
was due to "saving the data" until we figured out what to do with it.

After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
last week I was amazed by the results presented. I did test Hive with ORC
some time around May - August last year but had some issues with e.g.
partitioning, bucketing and streaming data into ORC while also updating the
row indexes. In addition, @t3rmin4t0r also presented the used of bloom
filters.

I have decided I will move my raw-data into HIVE using ORC and zlib. How
would you guys recommend I would do that? We have a setup for our stream
processing which takes the same data and puts it into Kafka. Then one Storm
topology parse each event into a JSON format which we move back to another
Kafka topic. We then consume this parsed_topic to put the data into e.g.
Elasticsearch etc

Due to the nature of the size of my data I only have 2-3 weeks of data in
Kafka. So it is not an option to just reset the offsets and use storm on
the data inside Kafka to stream them to Hive/ORC. I think with regards to
speed map-reduce would probablt do this faster than pushing it through
storm. However, laster I will add a storm-topology to read the newly
created events from the parsed_topic and stream them into Hive/ORC.

My options;
1) write a map-reduce job which reads the GZIP files in HDFS and import my
Java libs to parse each line of event and put them to Hive/ORC.
2) write a storm-topology to read the parsed_topic and stream them to
Hive/ORC. Which also means I would need to have something which reads the
GZIP files from HDFS and puts them to Kafka to enable all of my on-disk
events be processed.
3) use spark instead of map-reduce. Only, I dont see any benefits in doing
so with this scenario.

Thoughts? Insight?

Thanks,
Kjell Tore

Re: Parsing and moving data to ORC from HDFS

Posted by Gopal Vijayaraghavan <go...@hortonworks.com>.
> In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
>filters might move to GA?

I¹d say that it will be in the next release, considering it is already
checked into hive-trunk.

Bloom filters aren¹t too crazy today. They are written within the ORC file
right next to the row-index data, so that there¹s no staleness issues with
this today & after that they¹re fairly well-understood structures.

I¹m working through ³bad use² safety scenarios like someone searching for
³11² (as a string) in a data-set which contains doubles.

Hive FilterOperator casts this dynamically, but the ORC PPD has to do
those type promotions exacty as hive would do in FilterOperator throughout
the bloom filter checks.

Calling something production-ready needs that sort of work, rather than
the feature¹s happy path of best performance.
 

> The data is single-line text events. Nothing fancy, no multiline or any
>binary. Each event is 200 - 800 bytes long.
> The format of these events are in 5 types (from which application
>produce them) and none are JSON. I wrote a small lib with 5 Java classes
> which interface parse(String raw) and return a JSONObject - utilized in
>my Storm bolts.

You could define that as a regular 1 column TEXTFILE and use a non-present
character as a delimiter (like ^A), which means you should be able to do
something like

select x.a, x.b, x.c from (select parse_my_format(line) as x from
raw_text_table);

a UDF is massively easier to write than a SerDe.

I effectively do something similar with get_json_object() to extract 1
column out (FWIW, Tez SimpleHistoryLogging writes out a Hive table).
 

> So I need to write my own format reader, a custom SerDe - specifically
>the Deserializer part? Then 5 schema-on-read external tables using my
>custom SerDe.
...
> That doesn't sound too bad! I expect bugs :)

Well, the UDF returning a Struct is an alternative to writing a SerDe.

> This all is just to catch up and clean our historical, garbage bin of
>data which piled up while we got Kafka - Storm - Elasticsearch running :-)

One problem at a time, I guess.

If any of this needs help, that¹s the sort of thing this list exists for.

Cheers,
Gopal


Re: Parsing and moving data to ORC from HDFS

Posted by Kjell Tore Fossbakk <kj...@gmail.com>.
Hey Gopal.
Thanks for your answers. I did some followups;


On Wed, Apr 22, 2015 at 3:46 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > I have about 100 TB of data, approximately 180 billion events, in my
> >HDFS cluster. It is my raw data stored as GZIP files. At the time of
> >setup this was due to "saving the data" until we figured out what to do
> >with it.
> >
> > After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
> >last week I was amazed by the results presented.
>
> I run at the very cutting edge of the builds all the time :)
>
> The bloom filters are there in hive-1.2.0 which is currently sitting in
> svn/git today.
>
>
In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
filters might move to GA?


>
> > I have decided I will move my raw-data into HIVE using ORC and zlib. How
> >would you guys recommend I would do that?
>
> The best mechanism is always to write it via a Hive SQL ETL query.
>
> The real question is how the events are exactly organized. Is it a flat
> structure with something like a single line of JSON for each data item?
>

The data is single-line text events. Nothing fancy, no multiline or any
binary. Each event is 200 - 800 bytes long. The format of these events are
in 5 types (from which application produce them) and none are JSON. I wrote
a small lib with 5 Java classes which interface parse(String raw) and
return a JSONObject - utilized in my Storm bolts.


>
> That is much more easy to process than other data formats - the gzipped
> data can be natively read by Hive without any trouble.
>
> The Hive-JSON-Serde is very useful for that, because it allows you to read
> random data out of the system - each ³view² would be an external table
> enforcing a schema onto a fixed data set (including maps/arrays).
>
> You would create maybe 3-4 of these schema-on-read tables, then insert
> into your ORC structures from those tables.
>
> If you had binary data, then it would be much easier to write a convertor
> to JSON & then follow the same process as well instead of attempting a
> direct ORC writer, if you want >1 views out of the same table using
> external tables.
>

So I need to write my own format reader, a custom SerDe - specifically the
Deserializer part? Then 5 schema-on-read external tables using my custom
SerDe. And then again 5 new tables created as select statements from each
of the 5 external tables stored as ORC with e.g. zlib?

That doesn't sound too bad! I expect bugs :)


> > 2) write a storm-topology to read the parsed_topic and stream them to
> >Hive/ORC.
>
> You need to effectively do that to keep a live system running.
>
> We¹ve had some hiccups with the ORC feeder bolt earlier with the <2s ETL
> speeds (see
> https://github.com/apache/storm/tree/master/external/storm-hive).
>
> That needs some metastore tweaking to work perfectly (tables to be marked
> transactional etc), but nothing beyond config params.
>

Ok. Thanks for the tip. We put all the data into Elasticsearch which is
searchable for N days. Thus, it is not a big priority to have the data in
Hive as quickly as possible but Storm does seem like a nice place to put my
code to do this.

When we'r able to stream parsed JSON data from Kafka into Hive/ORC we'll
stop putting raw-gzip data onto HDFS.

This all is just to catch up and clean our historical, garbage bin of data
which piled up while we got Kafka - Storm - Elasticsearch running :-)


> > 3) use spark instead of map-reduce. Only, I dont see any benefits in
> >doing so with this scenario.
>
> The ORC writers in Spark (even if you merge the PR SPARK-2883) are really
> slow because they are built against hive-13.x (which was my ³before²
> comparison in all my slides).
>

Yes. I did play with hive < 13.x something, without tez. I liked your
performance data better :-)


>
> I really wish they¹d merge those changes into a release, so that I could
> make ORC+Spark fast.
>
>
> Cheers,
> Gopal
>
>
>
Thanks,
Kjell Tore

Re: Parsing and moving data to ORC from HDFS

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> I have about 100 TB of data, approximately 180 billion events, in my
>HDFS cluster. It is my raw data stored as GZIP files. At the time of
>setup this was due to "saving the data" until we figured out what to do
>with it.
>
> After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
>last week I was amazed by the results presented.

I run at the very cutting edge of the builds all the time :)

The bloom filters are there in hive-1.2.0 which is currently sitting in
svn/git today.


> I have decided I will move my raw-data into HIVE using ORC and zlib. How
>would you guys recommend I would do that?

The best mechanism is always to write it via a Hive SQL ETL query.

The real question is how the events are exactly organized. Is it a flat
structure with something like a single line of JSON for each data item?

That is much more easy to process than other data formats - the gzipped
data can be natively read by Hive without any trouble.

The Hive-JSON-Serde is very useful for that, because it allows you to read
random data out of the system - each ³view² would be an external table
enforcing a schema onto a fixed data set (including maps/arrays).

You would create maybe 3-4 of these schema-on-read tables, then insert
into your ORC structures from those tables.

If you had binary data, then it would be much easier to write a convertor
to JSON & then follow the same process as well instead of attempting a
direct ORC writer, if you want >1 views out of the same table using
external tables.

> 2) write a storm-topology to read the parsed_topic and stream them to
>Hive/ORC.

You need to effectively do that to keep a live system running.

We¹ve had some hiccups with the ORC feeder bolt earlier with the <2s ETL
speeds (see 
https://github.com/apache/storm/tree/master/external/storm-hive).

That needs some metastore tweaking to work perfectly (tables to be marked
transactional etc), but nothing beyond config params.

> 3) use spark instead of map-reduce. Only, I dont see any benefits in
>doing so with this scenario.

The ORC writers in Spark (even if you merge the PR SPARK-2883) are really
slow because they are built against hive-13.x (which was my ³before²
comparison in all my slides).

I really wish they¹d merge those changes into a release, so that I could
make ORC+Spark fast.


Cheers,
Gopal 



Re: Parsing and moving data to ORC from HDFS

Posted by Kjell Tore Fossbakk <kj...@gmail.com>.
It is worth to mention it is 100TB raw size, approximately 19TB with gzip
-9 (best/slowed compression)

On Wed, Apr 22, 2015 at 2:50 PM, Kjell Tore Fossbakk <kj...@gmail.com>
wrote:

> Hello user@hive.apache.org
>
> I have about 100 TB of data, approximately 180 billion events, in my HDFS
> cluster. It is my raw data stored as GZIP files. At the time of setup this
> was due to "saving the data" until we figured out what to do with it.
>
> After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
> last week I was amazed by the results presented. I did test Hive with ORC
> some time around May - August last year but had some issues with e.g.
> partitioning, bucketing and streaming data into ORC while also updating the
> row indexes. In addition, @t3rmin4t0r also presented the used of bloom
> filters.
>
> I have decided I will move my raw-data into HIVE using ORC and zlib. How
> would you guys recommend I would do that? We have a setup for our stream
> processing which takes the same data and puts it into Kafka. Then one Storm
> topology parse each event into a JSON format which we move back to another
> Kafka topic. We then consume this parsed_topic to put the data into e.g.
> Elasticsearch etc
>
> Due to the nature of the size of my data I only have 2-3 weeks of data in
> Kafka. So it is not an option to just reset the offsets and use storm on
> the data inside Kafka to stream them to Hive/ORC. I think with regards to
> speed map-reduce would probablt do this faster than pushing it through
> storm. However, laster I will add a storm-topology to read the newly
> created events from the parsed_topic and stream them into Hive/ORC.
>
> My options;
> 1) write a map-reduce job which reads the GZIP files in HDFS and import my
> Java libs to parse each line of event and put them to Hive/ORC.
> 2) write a storm-topology to read the parsed_topic and stream them to
> Hive/ORC. Which also means I would need to have something which reads the
> GZIP files from HDFS and puts them to Kafka to enable all of my on-disk
> events be processed.
> 3) use spark instead of map-reduce. Only, I dont see any benefits in doing
> so with this scenario.
>
> Thoughts? Insight?
>
> Thanks,
> Kjell Tore
>