You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Wolfgang Wyremba <wo...@hotmail.com> on 2013/09/30 09:40:44 UTC

File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Hello,

the file format topic is still confusing me and I would appreciate if you
could share your thoughts and experience with me.

>From reading different books/articles/websites I understand that
- Sequence files (used frequently but not only for binary data),
- AVRO,
- RC (was developed to work best with Hive -columnar storage) and
- ORC (a successor of RC to give Hive another performance boost - Stinger
initiative)
are all container file formats to solve the "small files problem" and all
support compression and splitting.
Additionally, each file format was developed with specific features/benefits
in mind.

Imagine I have the following text source data
- 1 TB of XML documents (some millions of small files)
- 1 TB of JSON documents (some hundred thousands of medium sized files)
- 1 TB of Apache log files (some thousands of bigger files)

How should I store this data in HDFS to process it using Java MapReduce and
Pig and Hive? 
I want to use the best tool for my specific problem - with "best"
performance of course - i.e. maybe one problem on the apache log data can be
best solved using Java MapReduce, another one using Hive or Pig.

Should I simply put the data into HDFS as the data comes from - i.e. as
plain text files?
Or should I convert all my data to a container file format like sequence
files, AVRO, RC or ORC?

Based on this example, I believe 
- the XML documents will be need to be converted to a container file format
to overcome the "small files problem".
- the JSON documents could/should not be affected by the "small files
problem"
- the Apache files should definitely not be affected by the "small files
problem", so they could be stored as plain text files.

So, some source data needs to be converted to a container file format,
others not necessarily.
But what is really advisable?

Is it advisable to store all data (XML, JSON, Apache logs) in one specific
container file format in the cluster- let's say you decide to use sequence
files?
Having only one file format in HDFS is of course a benefit in terms of
managing the files and writing Java MapReduce/Pig/Hive code against it.
Sequence files in this case is certainly not a bad idea, but Hive queries
could probably better benefit from let's say RC/ORC.

Therefore, is it better to use a mix of plain text files and/or one or more
container file formats simultaneously?

I know that there will be no crystal-clear answer here as it always
"depends", but what approach should be taken here, or what is usually used
in the community out there?

I welcome any feedback and experiences you made.

Thanks

RE: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by java8964 java8964 <ja...@hotmail.com>.

I am also thinking about this for my current project, so here I share some of my thoughts, but maybe some of them are not correct.
1) In my previous projects years ago, we store a lot of data as plain text, as at that time, people thinks the Big data can store all the data, no need to worry about space issues, until we run out of space in the cluster very fast :-). So lesson number 1, don't store them as text file.2) To compress text file, we need a container for the file, like 'Seq file', 'Avro' or 'proto buf'. I didn't use RC/ORC before, so it will be interested to know more about them later.3) I did a benchmark before, for the data sets we are using, there is also a webpage about the benchmark result. You can google it. I believe the performance from them are close. The real questions are:    a) Language supports    b) Flexible of serialization format    c) How easy it can be used in tools like 'pig/hive' etc.    d) How good it supported in hadoop.
>From my experience, sequence file is not good supported outside of Java language, and it is just a key/value storage, if your data have nest structure data, like your XML/JSON data, you still need a serialization format like google protobuf or Avro to handle it. Store directly XML/JSON in HDFS is really not a good idea. As any InputFormat to support split for them them all requires strict format of the data, and compression won't work very nicely on these kind of data.
We originally used google protobuf a lot, as twitter releases the elephant-bird as open source to support it in hadoop. It is a big plus for it at that time. But recently, we also start to consider Avro seriously now, as it is better supported directly in hadoop. I also like its schema-less vs schema objects both options design. It gives us some flexibility in designing MR jobs.
Thanks
Yong

> From: wolfgang.wyremba@hotmail.com
> To: user@hadoop.apache.org
> Subject: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC 
> Date: Mon, 30 Sep 2013 09:40:44 +0200
> 
> Hello,
> 
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
> 
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific features/benefits
> in mind.
> 
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
> 
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive? 
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can be
> best solved using Java MapReduce, another one using Hive or Pig.
> 
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
> 
> Based on this example, I believe 
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
> 
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
> 
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
> 
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
> 
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
> 
> I welcome any feedback and experiences you made.
> 
> Thanks
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sequence files are language neutral as Avro. Yes , but not sure about the
support of other language lib for processing seq files.

Thanks,
Rahul



On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> It is not recommended to keep the data at rest in sequences format,
> because it is Java specific and you cannot share it with other none-java
> systems easily, it is ideal for running map/reduce jobs. On approach would
> be to bring all the data of different formats in HDFS as is and then
> convert them to a single format that works best for you depending on
> whether you will export this data out or not (in addition to many other
> considerations). But as already mentioned Hive can directly read any of
> these formats.
>
>
> On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com>wrote:
>
>> for xml files processing hadoop comes with a class for this purpose
>> called StreamXmlRecordReader,You can use it by setting your input format
>> to StreamInputFormat and setting the
>> stream.recordreader.class property to
>> org.apache.hadoop.streaming.StreamXmlRecordReader.
>>
>> for Json files, an open-source project ElephantBird that contains some
>> useful utilities for working with LZO compression, has a
>> LzoJsonInputFormat, which can read JSON, but it requires that the input
>> file be LZOP compressed. We’ll use this code as a template for our own JSON
>> InputFormat, which doesn’t have the LZOP compression requirement.
>>
>> if you are dealing with small files then sequence file format comes in
>> rescue, it stores sequences of binary key-value pairs. Sequence files
>> are well suited as a format for MapReduce data since they are
>> splittable,support compression.
>>
>>
>> ::::::::::::::::::::::::::::::::::::::::
>> Raj K Singh
>> http://in.linkedin.com/in/rajkrrsingh
>> http://www.rajkrrsingh.blogspot.com
>> Mobile  Tel: +91 (0)9899821370
>>
>>
>> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
>> wolfgang.wyremba@hotmail.com> wrote:
>>
>>> Hello,
>>>
>>> the file format topic is still confusing me and I would appreciate if you
>>> could share your thoughts and experience with me.
>>>
>>> From reading different books/articles/websites I understand that
>>> - Sequence files (used frequently but not only for binary data),
>>> - AVRO,
>>> - RC (was developed to work best with Hive -columnar storage) and
>>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>>> initiative)
>>> are all container file formats to solve the "small files problem" and all
>>> support compression and splitting.
>>> Additionally, each file format was developed with specific
>>> features/benefits
>>> in mind.
>>>
>>> Imagine I have the following text source data
>>> - 1 TB of XML documents (some millions of small files)
>>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>>> - 1 TB of Apache log files (some thousands of bigger files)
>>>
>>> How should I store this data in HDFS to process it using Java MapReduce
>>> and
>>> Pig and Hive?
>>> I want to use the best tool for my specific problem - with "best"
>>> performance of course - i.e. maybe one problem on the apache log data
>>> can be
>>> best solved using Java MapReduce, another one using Hive or Pig.
>>>
>>> Should I simply put the data into HDFS as the data comes from - i.e. as
>>> plain text files?
>>> Or should I convert all my data to a container file format like sequence
>>> files, AVRO, RC or ORC?
>>>
>>> Based on this example, I believe
>>> - the XML documents will be need to be converted to a container file
>>> format
>>> to overcome the "small files problem".
>>> - the JSON documents could/should not be affected by the "small files
>>> problem"
>>> - the Apache files should definitely not be affected by the "small files
>>> problem", so they could be stored as plain text files.
>>>
>>> So, some source data needs to be converted to a container file format,
>>> others not necessarily.
>>> But what is really advisable?
>>>
>>> Is it advisable to store all data (XML, JSON, Apache logs) in one
>>> specific
>>> container file format in the cluster- let's say you decide to use
>>> sequence
>>> files?
>>> Having only one file format in HDFS is of course a benefit in terms of
>>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>>> Sequence files in this case is certainly not a bad idea, but Hive queries
>>> could probably better benefit from let's say RC/ORC.
>>>
>>> Therefore, is it better to use a mix of plain text files and/or one or
>>> more
>>> container file formats simultaneously?
>>>
>>> I know that there will be no crystal-clear answer here as it always
>>> "depends", but what approach should be taken here, or what is usually
>>> used
>>> in the community out there?
>>>
>>> I welcome any feedback and experiences you made.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sequence files are language neutral as Avro. Yes , but not sure about the
support of other language lib for processing seq files.

Thanks,
Rahul



On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> It is not recommended to keep the data at rest in sequences format,
> because it is Java specific and you cannot share it with other none-java
> systems easily, it is ideal for running map/reduce jobs. On approach would
> be to bring all the data of different formats in HDFS as is and then
> convert them to a single format that works best for you depending on
> whether you will export this data out or not (in addition to many other
> considerations). But as already mentioned Hive can directly read any of
> these formats.
>
>
> On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com>wrote:
>
>> for xml files processing hadoop comes with a class for this purpose
>> called StreamXmlRecordReader,You can use it by setting your input format
>> to StreamInputFormat and setting the
>> stream.recordreader.class property to
>> org.apache.hadoop.streaming.StreamXmlRecordReader.
>>
>> for Json files, an open-source project ElephantBird that contains some
>> useful utilities for working with LZO compression, has a
>> LzoJsonInputFormat, which can read JSON, but it requires that the input
>> file be LZOP compressed. We’ll use this code as a template for our own JSON
>> InputFormat, which doesn’t have the LZOP compression requirement.
>>
>> if you are dealing with small files then sequence file format comes in
>> rescue, it stores sequences of binary key-value pairs. Sequence files
>> are well suited as a format for MapReduce data since they are
>> splittable,support compression.
>>
>>
>> ::::::::::::::::::::::::::::::::::::::::
>> Raj K Singh
>> http://in.linkedin.com/in/rajkrrsingh
>> http://www.rajkrrsingh.blogspot.com
>> Mobile  Tel: +91 (0)9899821370
>>
>>
>> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
>> wolfgang.wyremba@hotmail.com> wrote:
>>
>>> Hello,
>>>
>>> the file format topic is still confusing me and I would appreciate if you
>>> could share your thoughts and experience with me.
>>>
>>> From reading different books/articles/websites I understand that
>>> - Sequence files (used frequently but not only for binary data),
>>> - AVRO,
>>> - RC (was developed to work best with Hive -columnar storage) and
>>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>>> initiative)
>>> are all container file formats to solve the "small files problem" and all
>>> support compression and splitting.
>>> Additionally, each file format was developed with specific
>>> features/benefits
>>> in mind.
>>>
>>> Imagine I have the following text source data
>>> - 1 TB of XML documents (some millions of small files)
>>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>>> - 1 TB of Apache log files (some thousands of bigger files)
>>>
>>> How should I store this data in HDFS to process it using Java MapReduce
>>> and
>>> Pig and Hive?
>>> I want to use the best tool for my specific problem - with "best"
>>> performance of course - i.e. maybe one problem on the apache log data
>>> can be
>>> best solved using Java MapReduce, another one using Hive or Pig.
>>>
>>> Should I simply put the data into HDFS as the data comes from - i.e. as
>>> plain text files?
>>> Or should I convert all my data to a container file format like sequence
>>> files, AVRO, RC or ORC?
>>>
>>> Based on this example, I believe
>>> - the XML documents will be need to be converted to a container file
>>> format
>>> to overcome the "small files problem".
>>> - the JSON documents could/should not be affected by the "small files
>>> problem"
>>> - the Apache files should definitely not be affected by the "small files
>>> problem", so they could be stored as plain text files.
>>>
>>> So, some source data needs to be converted to a container file format,
>>> others not necessarily.
>>> But what is really advisable?
>>>
>>> Is it advisable to store all data (XML, JSON, Apache logs) in one
>>> specific
>>> container file format in the cluster- let's say you decide to use
>>> sequence
>>> files?
>>> Having only one file format in HDFS is of course a benefit in terms of
>>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>>> Sequence files in this case is certainly not a bad idea, but Hive queries
>>> could probably better benefit from let's say RC/ORC.
>>>
>>> Therefore, is it better to use a mix of plain text files and/or one or
>>> more
>>> container file formats simultaneously?
>>>
>>> I know that there will be no crystal-clear answer here as it always
>>> "depends", but what approach should be taken here, or what is usually
>>> used
>>> in the community out there?
>>>
>>> I welcome any feedback and experiences you made.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sequence files are language neutral as Avro. Yes , but not sure about the
support of other language lib for processing seq files.

Thanks,
Rahul



On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> It is not recommended to keep the data at rest in sequences format,
> because it is Java specific and you cannot share it with other none-java
> systems easily, it is ideal for running map/reduce jobs. On approach would
> be to bring all the data of different formats in HDFS as is and then
> convert them to a single format that works best for you depending on
> whether you will export this data out or not (in addition to many other
> considerations). But as already mentioned Hive can directly read any of
> these formats.
>
>
> On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com>wrote:
>
>> for xml files processing hadoop comes with a class for this purpose
>> called StreamXmlRecordReader,You can use it by setting your input format
>> to StreamInputFormat and setting the
>> stream.recordreader.class property to
>> org.apache.hadoop.streaming.StreamXmlRecordReader.
>>
>> for Json files, an open-source project ElephantBird that contains some
>> useful utilities for working with LZO compression, has a
>> LzoJsonInputFormat, which can read JSON, but it requires that the input
>> file be LZOP compressed. We’ll use this code as a template for our own JSON
>> InputFormat, which doesn’t have the LZOP compression requirement.
>>
>> if you are dealing with small files then sequence file format comes in
>> rescue, it stores sequences of binary key-value pairs. Sequence files
>> are well suited as a format for MapReduce data since they are
>> splittable,support compression.
>>
>>
>> ::::::::::::::::::::::::::::::::::::::::
>> Raj K Singh
>> http://in.linkedin.com/in/rajkrrsingh
>> http://www.rajkrrsingh.blogspot.com
>> Mobile  Tel: +91 (0)9899821370
>>
>>
>> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
>> wolfgang.wyremba@hotmail.com> wrote:
>>
>>> Hello,
>>>
>>> the file format topic is still confusing me and I would appreciate if you
>>> could share your thoughts and experience with me.
>>>
>>> From reading different books/articles/websites I understand that
>>> - Sequence files (used frequently but not only for binary data),
>>> - AVRO,
>>> - RC (was developed to work best with Hive -columnar storage) and
>>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>>> initiative)
>>> are all container file formats to solve the "small files problem" and all
>>> support compression and splitting.
>>> Additionally, each file format was developed with specific
>>> features/benefits
>>> in mind.
>>>
>>> Imagine I have the following text source data
>>> - 1 TB of XML documents (some millions of small files)
>>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>>> - 1 TB of Apache log files (some thousands of bigger files)
>>>
>>> How should I store this data in HDFS to process it using Java MapReduce
>>> and
>>> Pig and Hive?
>>> I want to use the best tool for my specific problem - with "best"
>>> performance of course - i.e. maybe one problem on the apache log data
>>> can be
>>> best solved using Java MapReduce, another one using Hive or Pig.
>>>
>>> Should I simply put the data into HDFS as the data comes from - i.e. as
>>> plain text files?
>>> Or should I convert all my data to a container file format like sequence
>>> files, AVRO, RC or ORC?
>>>
>>> Based on this example, I believe
>>> - the XML documents will be need to be converted to a container file
>>> format
>>> to overcome the "small files problem".
>>> - the JSON documents could/should not be affected by the "small files
>>> problem"
>>> - the Apache files should definitely not be affected by the "small files
>>> problem", so they could be stored as plain text files.
>>>
>>> So, some source data needs to be converted to a container file format,
>>> others not necessarily.
>>> But what is really advisable?
>>>
>>> Is it advisable to store all data (XML, JSON, Apache logs) in one
>>> specific
>>> container file format in the cluster- let's say you decide to use
>>> sequence
>>> files?
>>> Having only one file format in HDFS is of course a benefit in terms of
>>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>>> Sequence files in this case is certainly not a bad idea, but Hive queries
>>> could probably better benefit from let's say RC/ORC.
>>>
>>> Therefore, is it better to use a mix of plain text files and/or one or
>>> more
>>> container file formats simultaneously?
>>>
>>> I know that there will be no crystal-clear answer here as it always
>>> "depends", but what approach should be taken here, or what is usually
>>> used
>>> in the community out there?
>>>
>>> I welcome any feedback and experiences you made.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sequence files are language neutral as Avro. Yes , but not sure about the
support of other language lib for processing seq files.

Thanks,
Rahul



On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> It is not recommended to keep the data at rest in sequences format,
> because it is Java specific and you cannot share it with other none-java
> systems easily, it is ideal for running map/reduce jobs. On approach would
> be to bring all the data of different formats in HDFS as is and then
> convert them to a single format that works best for you depending on
> whether you will export this data out or not (in addition to many other
> considerations). But as already mentioned Hive can directly read any of
> these formats.
>
>
> On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com>wrote:
>
>> for xml files processing hadoop comes with a class for this purpose
>> called StreamXmlRecordReader,You can use it by setting your input format
>> to StreamInputFormat and setting the
>> stream.recordreader.class property to
>> org.apache.hadoop.streaming.StreamXmlRecordReader.
>>
>> for Json files, an open-source project ElephantBird that contains some
>> useful utilities for working with LZO compression, has a
>> LzoJsonInputFormat, which can read JSON, but it requires that the input
>> file be LZOP compressed. We’ll use this code as a template for our own JSON
>> InputFormat, which doesn’t have the LZOP compression requirement.
>>
>> if you are dealing with small files then sequence file format comes in
>> rescue, it stores sequences of binary key-value pairs. Sequence files
>> are well suited as a format for MapReduce data since they are
>> splittable,support compression.
>>
>>
>> ::::::::::::::::::::::::::::::::::::::::
>> Raj K Singh
>> http://in.linkedin.com/in/rajkrrsingh
>> http://www.rajkrrsingh.blogspot.com
>> Mobile  Tel: +91 (0)9899821370
>>
>>
>> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
>> wolfgang.wyremba@hotmail.com> wrote:
>>
>>> Hello,
>>>
>>> the file format topic is still confusing me and I would appreciate if you
>>> could share your thoughts and experience with me.
>>>
>>> From reading different books/articles/websites I understand that
>>> - Sequence files (used frequently but not only for binary data),
>>> - AVRO,
>>> - RC (was developed to work best with Hive -columnar storage) and
>>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>>> initiative)
>>> are all container file formats to solve the "small files problem" and all
>>> support compression and splitting.
>>> Additionally, each file format was developed with specific
>>> features/benefits
>>> in mind.
>>>
>>> Imagine I have the following text source data
>>> - 1 TB of XML documents (some millions of small files)
>>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>>> - 1 TB of Apache log files (some thousands of bigger files)
>>>
>>> How should I store this data in HDFS to process it using Java MapReduce
>>> and
>>> Pig and Hive?
>>> I want to use the best tool for my specific problem - with "best"
>>> performance of course - i.e. maybe one problem on the apache log data
>>> can be
>>> best solved using Java MapReduce, another one using Hive or Pig.
>>>
>>> Should I simply put the data into HDFS as the data comes from - i.e. as
>>> plain text files?
>>> Or should I convert all my data to a container file format like sequence
>>> files, AVRO, RC or ORC?
>>>
>>> Based on this example, I believe
>>> - the XML documents will be need to be converted to a container file
>>> format
>>> to overcome the "small files problem".
>>> - the JSON documents could/should not be affected by the "small files
>>> problem"
>>> - the Apache files should definitely not be affected by the "small files
>>> problem", so they could be stored as plain text files.
>>>
>>> So, some source data needs to be converted to a container file format,
>>> others not necessarily.
>>> But what is really advisable?
>>>
>>> Is it advisable to store all data (XML, JSON, Apache logs) in one
>>> specific
>>> container file format in the cluster- let's say you decide to use
>>> sequence
>>> files?
>>> Having only one file format in HDFS is of course a benefit in terms of
>>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>>> Sequence files in this case is certainly not a bad idea, but Hive queries
>>> could probably better benefit from let's say RC/ORC.
>>>
>>> Therefore, is it better to use a mix of plain text files and/or one or
>>> more
>>> container file formats simultaneously?
>>>
>>> I know that there will be no crystal-clear answer here as it always
>>> "depends", but what approach should be taken here, or what is usually
>>> used
>>> in the community out there?
>>>
>>> I welcome any feedback and experiences you made.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Peyman Mohajerian <mo...@gmail.com>.

It is not recommended to keep the data at rest in sequences format, because
it is Java specific and you cannot share it with other none-java systems
easily, it is ideal for running map/reduce jobs. On approach would be to
bring all the data of different formats in HDFS as is and then convert them
to a single format that works best for you depending on whether you will
export this data out or not (in addition to many other considerations). But
as already mentioned Hive can directly read any of these formats.


On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com> wrote:

> for xml files processing hadoop comes with a class for this purpose called
> StreamXmlRecordReader,You can use it by setting your input format to
> StreamInputFormat and setting the
> stream.recordreader.class property to
> org.apache.hadoop.streaming.StreamXmlRecordReader.
>
> for Json files, an open-source project ElephantBird that contains some
> useful utilities for working with LZO compression, has a
> LzoJsonInputFormat, which can read JSON, but it requires that the input
> file be LZOP compressed. We’ll use this code as a template for our own JSON
> InputFormat, which doesn’t have the LZOP compression requirement.
>
> if you are dealing with small files then sequence file format comes in
> rescue, it stores sequences of binary key-value pairs. Sequence files are
> well suited as a format for MapReduce data since they are
> splittable,support compression.
>
>
> ::::::::::::::::::::::::::::::::::::::::
> Raj K Singh
> http://in.linkedin.com/in/rajkrrsingh
> http://www.rajkrrsingh.blogspot.com
> Mobile  Tel: +91 (0)9899821370
>
>
> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
> wolfgang.wyremba@hotmail.com> wrote:
>
>> Hello,
>>
>> the file format topic is still confusing me and I would appreciate if you
>> could share your thoughts and experience with me.
>>
>> From reading different books/articles/websites I understand that
>> - Sequence files (used frequently but not only for binary data),
>> - AVRO,
>> - RC (was developed to work best with Hive -columnar storage) and
>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>> initiative)
>> are all container file formats to solve the "small files problem" and all
>> support compression and splitting.
>> Additionally, each file format was developed with specific
>> features/benefits
>> in mind.
>>
>> Imagine I have the following text source data
>> - 1 TB of XML documents (some millions of small files)
>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>> - 1 TB of Apache log files (some thousands of bigger files)
>>
>> How should I store this data in HDFS to process it using Java MapReduce
>> and
>> Pig and Hive?
>> I want to use the best tool for my specific problem - with "best"
>> performance of course - i.e. maybe one problem on the apache log data can
>> be
>> best solved using Java MapReduce, another one using Hive or Pig.
>>
>> Should I simply put the data into HDFS as the data comes from - i.e. as
>> plain text files?
>> Or should I convert all my data to a container file format like sequence
>> files, AVRO, RC or ORC?
>>
>> Based on this example, I believe
>> - the XML documents will be need to be converted to a container file
>> format
>> to overcome the "small files problem".
>> - the JSON documents could/should not be affected by the "small files
>> problem"
>> - the Apache files should definitely not be affected by the "small files
>> problem", so they could be stored as plain text files.
>>
>> So, some source data needs to be converted to a container file format,
>> others not necessarily.
>> But what is really advisable?
>>
>> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
>> container file format in the cluster- let's say you decide to use sequence
>> files?
>> Having only one file format in HDFS is of course a benefit in terms of
>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>> Sequence files in this case is certainly not a bad idea, but Hive queries
>> could probably better benefit from let's say RC/ORC.
>>
>> Therefore, is it better to use a mix of plain text files and/or one or
>> more
>> container file formats simultaneously?
>>
>> I know that there will be no crystal-clear answer here as it always
>> "depends", but what approach should be taken here, or what is usually used
>> in the community out there?
>>
>> I welcome any feedback and experiences you made.
>>
>> Thanks
>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Peyman Mohajerian <mo...@gmail.com>.

It is not recommended to keep the data at rest in sequences format, because
it is Java specific and you cannot share it with other none-java systems
easily, it is ideal for running map/reduce jobs. On approach would be to
bring all the data of different formats in HDFS as is and then convert them
to a single format that works best for you depending on whether you will
export this data out or not (in addition to many other considerations). But
as already mentioned Hive can directly read any of these formats.


On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com> wrote:

> for xml files processing hadoop comes with a class for this purpose called
> StreamXmlRecordReader,You can use it by setting your input format to
> StreamInputFormat and setting the
> stream.recordreader.class property to
> org.apache.hadoop.streaming.StreamXmlRecordReader.
>
> for Json files, an open-source project ElephantBird that contains some
> useful utilities for working with LZO compression, has a
> LzoJsonInputFormat, which can read JSON, but it requires that the input
> file be LZOP compressed. We’ll use this code as a template for our own JSON
> InputFormat, which doesn’t have the LZOP compression requirement.
>
> if you are dealing with small files then sequence file format comes in
> rescue, it stores sequences of binary key-value pairs. Sequence files are
> well suited as a format for MapReduce data since they are
> splittable,support compression.
>
>
> ::::::::::::::::::::::::::::::::::::::::
> Raj K Singh
> http://in.linkedin.com/in/rajkrrsingh
> http://www.rajkrrsingh.blogspot.com
> Mobile  Tel: +91 (0)9899821370
>
>
> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
> wolfgang.wyremba@hotmail.com> wrote:
>
>> Hello,
>>
>> the file format topic is still confusing me and I would appreciate if you
>> could share your thoughts and experience with me.
>>
>> From reading different books/articles/websites I understand that
>> - Sequence files (used frequently but not only for binary data),
>> - AVRO,
>> - RC (was developed to work best with Hive -columnar storage) and
>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>> initiative)
>> are all container file formats to solve the "small files problem" and all
>> support compression and splitting.
>> Additionally, each file format was developed with specific
>> features/benefits
>> in mind.
>>
>> Imagine I have the following text source data
>> - 1 TB of XML documents (some millions of small files)
>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>> - 1 TB of Apache log files (some thousands of bigger files)
>>
>> How should I store this data in HDFS to process it using Java MapReduce
>> and
>> Pig and Hive?
>> I want to use the best tool for my specific problem - with "best"
>> performance of course - i.e. maybe one problem on the apache log data can
>> be
>> best solved using Java MapReduce, another one using Hive or Pig.
>>
>> Should I simply put the data into HDFS as the data comes from - i.e. as
>> plain text files?
>> Or should I convert all my data to a container file format like sequence
>> files, AVRO, RC or ORC?
>>
>> Based on this example, I believe
>> - the XML documents will be need to be converted to a container file
>> format
>> to overcome the "small files problem".
>> - the JSON documents could/should not be affected by the "small files
>> problem"
>> - the Apache files should definitely not be affected by the "small files
>> problem", so they could be stored as plain text files.
>>
>> So, some source data needs to be converted to a container file format,
>> others not necessarily.
>> But what is really advisable?
>>
>> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
>> container file format in the cluster- let's say you decide to use sequence
>> files?
>> Having only one file format in HDFS is of course a benefit in terms of
>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>> Sequence files in this case is certainly not a bad idea, but Hive queries
>> could probably better benefit from let's say RC/ORC.
>>
>> Therefore, is it better to use a mix of plain text files and/or one or
>> more
>> container file formats simultaneously?
>>
>> I know that there will be no crystal-clear answer here as it always
>> "depends", but what approach should be taken here, or what is usually used
>> in the community out there?
>>
>> I welcome any feedback and experiences you made.
>>
>> Thanks
>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Peyman Mohajerian <mo...@gmail.com>.

It is not recommended to keep the data at rest in sequences format, because
it is Java specific and you cannot share it with other none-java systems
easily, it is ideal for running map/reduce jobs. On approach would be to
bring all the data of different formats in HDFS as is and then convert them
to a single format that works best for you depending on whether you will
export this data out or not (in addition to many other considerations). But
as already mentioned Hive can directly read any of these formats.


On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com> wrote:

> for xml files processing hadoop comes with a class for this purpose called
> StreamXmlRecordReader,You can use it by setting your input format to
> StreamInputFormat and setting the
> stream.recordreader.class property to
> org.apache.hadoop.streaming.StreamXmlRecordReader.
>
> for Json files, an open-source project ElephantBird that contains some
> useful utilities for working with LZO compression, has a
> LzoJsonInputFormat, which can read JSON, but it requires that the input
> file be LZOP compressed. We’ll use this code as a template for our own JSON
> InputFormat, which doesn’t have the LZOP compression requirement.
>
> if you are dealing with small files then sequence file format comes in
> rescue, it stores sequences of binary key-value pairs. Sequence files are
> well suited as a format for MapReduce data since they are
> splittable,support compression.
>
>
> ::::::::::::::::::::::::::::::::::::::::
> Raj K Singh
> http://in.linkedin.com/in/rajkrrsingh
> http://www.rajkrrsingh.blogspot.com
> Mobile  Tel: +91 (0)9899821370
>
>
> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
> wolfgang.wyremba@hotmail.com> wrote:
>
>> Hello,
>>
>> the file format topic is still confusing me and I would appreciate if you
>> could share your thoughts and experience with me.
>>
>> From reading different books/articles/websites I understand that
>> - Sequence files (used frequently but not only for binary data),
>> - AVRO,
>> - RC (was developed to work best with Hive -columnar storage) and
>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>> initiative)
>> are all container file formats to solve the "small files problem" and all
>> support compression and splitting.
>> Additionally, each file format was developed with specific
>> features/benefits
>> in mind.
>>
>> Imagine I have the following text source data
>> - 1 TB of XML documents (some millions of small files)
>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>> - 1 TB of Apache log files (some thousands of bigger files)
>>
>> How should I store this data in HDFS to process it using Java MapReduce
>> and
>> Pig and Hive?
>> I want to use the best tool for my specific problem - with "best"
>> performance of course - i.e. maybe one problem on the apache log data can
>> be
>> best solved using Java MapReduce, another one using Hive or Pig.
>>
>> Should I simply put the data into HDFS as the data comes from - i.e. as
>> plain text files?
>> Or should I convert all my data to a container file format like sequence
>> files, AVRO, RC or ORC?
>>
>> Based on this example, I believe
>> - the XML documents will be need to be converted to a container file
>> format
>> to overcome the "small files problem".
>> - the JSON documents could/should not be affected by the "small files
>> problem"
>> - the Apache files should definitely not be affected by the "small files
>> problem", so they could be stored as plain text files.
>>
>> So, some source data needs to be converted to a container file format,
>> others not necessarily.
>> But what is really advisable?
>>
>> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
>> container file format in the cluster- let's say you decide to use sequence
>> files?
>> Having only one file format in HDFS is of course a benefit in terms of
>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>> Sequence files in this case is certainly not a bad idea, but Hive queries
>> could probably better benefit from let's say RC/ORC.
>>
>> Therefore, is it better to use a mix of plain text files and/or one or
>> more
>> container file formats simultaneously?
>>
>> I know that there will be no crystal-clear answer here as it always
>> "depends", but what approach should be taken here, or what is usually used
>> in the community out there?
>>
>> I welcome any feedback and experiences you made.
>>
>> Thanks
>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Peyman Mohajerian <mo...@gmail.com>.

It is not recommended to keep the data at rest in sequences format, because
it is Java specific and you cannot share it with other none-java systems
easily, it is ideal for running map/reduce jobs. On approach would be to
bring all the data of different formats in HDFS as is and then convert them
to a single format that works best for you depending on whether you will
export this data out or not (in addition to many other considerations). But
as already mentioned Hive can directly read any of these formats.


On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <ra...@gmail.com> wrote:

> for xml files processing hadoop comes with a class for this purpose called
> StreamXmlRecordReader,You can use it by setting your input format to
> StreamInputFormat and setting the
> stream.recordreader.class property to
> org.apache.hadoop.streaming.StreamXmlRecordReader.
>
> for Json files, an open-source project ElephantBird that contains some
> useful utilities for working with LZO compression, has a
> LzoJsonInputFormat, which can read JSON, but it requires that the input
> file be LZOP compressed. We’ll use this code as a template for our own JSON
> InputFormat, which doesn’t have the LZOP compression requirement.
>
> if you are dealing with small files then sequence file format comes in
> rescue, it stores sequences of binary key-value pairs. Sequence files are
> well suited as a format for MapReduce data since they are
> splittable,support compression.
>
>
> ::::::::::::::::::::::::::::::::::::::::
> Raj K Singh
> http://in.linkedin.com/in/rajkrrsingh
> http://www.rajkrrsingh.blogspot.com
> Mobile  Tel: +91 (0)9899821370
>
>
> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
> wolfgang.wyremba@hotmail.com> wrote:
>
>> Hello,
>>
>> the file format topic is still confusing me and I would appreciate if you
>> could share your thoughts and experience with me.
>>
>> From reading different books/articles/websites I understand that
>> - Sequence files (used frequently but not only for binary data),
>> - AVRO,
>> - RC (was developed to work best with Hive -columnar storage) and
>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>> initiative)
>> are all container file formats to solve the "small files problem" and all
>> support compression and splitting.
>> Additionally, each file format was developed with specific
>> features/benefits
>> in mind.
>>
>> Imagine I have the following text source data
>> - 1 TB of XML documents (some millions of small files)
>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>> - 1 TB of Apache log files (some thousands of bigger files)
>>
>> How should I store this data in HDFS to process it using Java MapReduce
>> and
>> Pig and Hive?
>> I want to use the best tool for my specific problem - with "best"
>> performance of course - i.e. maybe one problem on the apache log data can
>> be
>> best solved using Java MapReduce, another one using Hive or Pig.
>>
>> Should I simply put the data into HDFS as the data comes from - i.e. as
>> plain text files?
>> Or should I convert all my data to a container file format like sequence
>> files, AVRO, RC or ORC?
>>
>> Based on this example, I believe
>> - the XML documents will be need to be converted to a container file
>> format
>> to overcome the "small files problem".
>> - the JSON documents could/should not be affected by the "small files
>> problem"
>> - the Apache files should definitely not be affected by the "small files
>> problem", so they could be stored as plain text files.
>>
>> So, some source data needs to be converted to a container file format,
>> others not necessarily.
>> But what is really advisable?
>>
>> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
>> container file format in the cluster- let's say you decide to use sequence
>> files?
>> Having only one file format in HDFS is of course a benefit in terms of
>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>> Sequence files in this case is certainly not a bad idea, but Hive queries
>> could probably better benefit from let's say RC/ORC.
>>
>> Therefore, is it better to use a mix of plain text files and/or one or
>> more
>> container file formats simultaneously?
>>
>> I know that there will be no crystal-clear answer here as it always
>> "depends", but what approach should be taken here, or what is usually used
>> in the community out there?
>>
>> I welcome any feedback and experiences you made.
>>
>> Thanks
>>
>>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Raj K Singh <ra...@gmail.com>.

for xml files processing hadoop comes with a class for this purpose called
StreamXmlRecordReader,You can use it by setting your input format to
StreamInputFormat and setting the
stream.recordreader.class property to
org.apache.hadoop.streaming.StreamXmlRecordReader.

for Json files, an open-source project ElephantBird that contains some
useful utilities for working with LZO compression, has a
LzoJsonInputFormat, which can read JSON, but it requires that the input
file be LZOP compressed. We’ll use this code as a template for our own JSON
InputFormat, which doesn’t have the LZOP compression requirement.

if you are dealing with small files then sequence file format comes in
rescue, it stores sequences of binary key-value pairs. Sequence files are
well suited as a format for MapReduce data since they are
splittable,support compression.


::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370


On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
wolfgang.wyremba@hotmail.com> wrote:

> Hello,
>
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
>
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific
> features/benefits
> in mind.
>
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
>
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive?
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can
> be
> best solved using Java MapReduce, another one using Hive or Pig.
>
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
>
> Based on this example, I believe
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
>
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
>
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
>
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
>
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
>
> I welcome any feedback and experiences you made.
>
> Thanks
>
>

RE: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by java8964 java8964 <ja...@hotmail.com>.

I am also thinking about this for my current project, so here I share some of my thoughts, but maybe some of them are not correct.
1) In my previous projects years ago, we store a lot of data as plain text, as at that time, people thinks the Big data can store all the data, no need to worry about space issues, until we run out of space in the cluster very fast :-). So lesson number 1, don't store them as text file.2) To compress text file, we need a container for the file, like 'Seq file', 'Avro' or 'proto buf'. I didn't use RC/ORC before, so it will be interested to know more about them later.3) I did a benchmark before, for the data sets we are using, there is also a webpage about the benchmark result. You can google it. I believe the performance from them are close. The real questions are:    a) Language supports    b) Flexible of serialization format    c) How easy it can be used in tools like 'pig/hive' etc.    d) How good it supported in hadoop.
>From my experience, sequence file is not good supported outside of Java language, and it is just a key/value storage, if your data have nest structure data, like your XML/JSON data, you still need a serialization format like google protobuf or Avro to handle it. Store directly XML/JSON in HDFS is really not a good idea. As any InputFormat to support split for them them all requires strict format of the data, and compression won't work very nicely on these kind of data.
We originally used google protobuf a lot, as twitter releases the elephant-bird as open source to support it in hadoop. It is a big plus for it at that time. But recently, we also start to consider Avro seriously now, as it is better supported directly in hadoop. I also like its schema-less vs schema objects both options design. It gives us some flexibility in designing MR jobs.
Thanks
Yong

> From: wolfgang.wyremba@hotmail.com
> To: user@hadoop.apache.org
> Subject: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC 
> Date: Mon, 30 Sep 2013 09:40:44 +0200
> 
> Hello,
> 
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
> 
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific features/benefits
> in mind.
> 
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
> 
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive? 
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can be
> best solved using Java MapReduce, another one using Hive or Pig.
> 
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
> 
> Based on this example, I believe 
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
> 
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
> 
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
> 
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
> 
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
> 
> I welcome any feedback and experiences you made.
> 
> Thanks
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Raj K Singh <ra...@gmail.com>.

for xml files processing hadoop comes with a class for this purpose called
StreamXmlRecordReader,You can use it by setting your input format to
StreamInputFormat and setting the
stream.recordreader.class property to
org.apache.hadoop.streaming.StreamXmlRecordReader.

for Json files, an open-source project ElephantBird that contains some
useful utilities for working with LZO compression, has a
LzoJsonInputFormat, which can read JSON, but it requires that the input
file be LZOP compressed. We’ll use this code as a template for our own JSON
InputFormat, which doesn’t have the LZOP compression requirement.

if you are dealing with small files then sequence file format comes in
rescue, it stores sequences of binary key-value pairs. Sequence files are
well suited as a format for MapReduce data since they are
splittable,support compression.


::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370


On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
wolfgang.wyremba@hotmail.com> wrote:

> Hello,
>
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
>
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific
> features/benefits
> in mind.
>
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
>
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive?
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can
> be
> best solved using Java MapReduce, another one using Hive or Pig.
>
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
>
> Based on this example, I believe
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
>
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
>
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
>
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
>
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
>
> I welcome any feedback and experiences you made.
>
> Thanks
>
>

RE: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by java8964 java8964 <ja...@hotmail.com>.

I am also thinking about this for my current project, so here I share some of my thoughts, but maybe some of them are not correct.
1) In my previous projects years ago, we store a lot of data as plain text, as at that time, people thinks the Big data can store all the data, no need to worry about space issues, until we run out of space in the cluster very fast :-). So lesson number 1, don't store them as text file.2) To compress text file, we need a container for the file, like 'Seq file', 'Avro' or 'proto buf'. I didn't use RC/ORC before, so it will be interested to know more about them later.3) I did a benchmark before, for the data sets we are using, there is also a webpage about the benchmark result. You can google it. I believe the performance from them are close. The real questions are:    a) Language supports    b) Flexible of serialization format    c) How easy it can be used in tools like 'pig/hive' etc.    d) How good it supported in hadoop.
>From my experience, sequence file is not good supported outside of Java language, and it is just a key/value storage, if your data have nest structure data, like your XML/JSON data, you still need a serialization format like google protobuf or Avro to handle it. Store directly XML/JSON in HDFS is really not a good idea. As any InputFormat to support split for them them all requires strict format of the data, and compression won't work very nicely on these kind of data.
We originally used google protobuf a lot, as twitter releases the elephant-bird as open source to support it in hadoop. It is a big plus for it at that time. But recently, we also start to consider Avro seriously now, as it is better supported directly in hadoop. I also like its schema-less vs schema objects both options design. It gives us some flexibility in designing MR jobs.
Thanks
Yong

> From: wolfgang.wyremba@hotmail.com
> To: user@hadoop.apache.org
> Subject: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC 
> Date: Mon, 30 Sep 2013 09:40:44 +0200
> 
> Hello,
> 
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
> 
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific features/benefits
> in mind.
> 
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
> 
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive? 
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can be
> best solved using Java MapReduce, another one using Hive or Pig.
> 
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
> 
> Based on this example, I believe 
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
> 
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
> 
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
> 
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
> 
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
> 
> I welcome any feedback and experiences you made.
> 
> Thanks
>

RE: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by java8964 java8964 <ja...@hotmail.com>.

I am also thinking about this for my current project, so here I share some of my thoughts, but maybe some of them are not correct.
1) In my previous projects years ago, we store a lot of data as plain text, as at that time, people thinks the Big data can store all the data, no need to worry about space issues, until we run out of space in the cluster very fast :-). So lesson number 1, don't store them as text file.2) To compress text file, we need a container for the file, like 'Seq file', 'Avro' or 'proto buf'. I didn't use RC/ORC before, so it will be interested to know more about them later.3) I did a benchmark before, for the data sets we are using, there is also a webpage about the benchmark result. You can google it. I believe the performance from them are close. The real questions are:    a) Language supports    b) Flexible of serialization format    c) How easy it can be used in tools like 'pig/hive' etc.    d) How good it supported in hadoop.
>From my experience, sequence file is not good supported outside of Java language, and it is just a key/value storage, if your data have nest structure data, like your XML/JSON data, you still need a serialization format like google protobuf or Avro to handle it. Store directly XML/JSON in HDFS is really not a good idea. As any InputFormat to support split for them them all requires strict format of the data, and compression won't work very nicely on these kind of data.
We originally used google protobuf a lot, as twitter releases the elephant-bird as open source to support it in hadoop. It is a big plus for it at that time. But recently, we also start to consider Avro seriously now, as it is better supported directly in hadoop. I also like its schema-less vs schema objects both options design. It gives us some flexibility in designing MR jobs.
Thanks
Yong

> From: wolfgang.wyremba@hotmail.com
> To: user@hadoop.apache.org
> Subject: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC 
> Date: Mon, 30 Sep 2013 09:40:44 +0200
> 
> Hello,
> 
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
> 
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific features/benefits
> in mind.
> 
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
> 
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive? 
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can be
> best solved using Java MapReduce, another one using Hive or Pig.
> 
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
> 
> Based on this example, I believe 
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
> 
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
> 
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
> 
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
> 
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
> 
> I welcome any feedback and experiences you made.
> 
> Thanks
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Raj K Singh <ra...@gmail.com>.

for xml files processing hadoop comes with a class for this purpose called
StreamXmlRecordReader,You can use it by setting your input format to
StreamInputFormat and setting the
stream.recordreader.class property to
org.apache.hadoop.streaming.StreamXmlRecordReader.

for Json files, an open-source project ElephantBird that contains some
useful utilities for working with LZO compression, has a
LzoJsonInputFormat, which can read JSON, but it requires that the input
file be LZOP compressed. We’ll use this code as a template for our own JSON
InputFormat, which doesn’t have the LZOP compression requirement.

if you are dealing with small files then sequence file format comes in
rescue, it stores sequences of binary key-value pairs. Sequence files are
well suited as a format for MapReduce data since they are
splittable,support compression.


::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370


On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
wolfgang.wyremba@hotmail.com> wrote:

> Hello,
>
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
>
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific
> features/benefits
> in mind.
>
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
>
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive?
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can
> be
> best solved using Java MapReduce, another one using Hive or Pig.
>
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
>
> Based on this example, I believe
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
>
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
>
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
>
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
>
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
>
> I welcome any feedback and experiences you made.
>
> Thanks
>
>

Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

Posted by Raj K Singh <ra...@gmail.com>.

for xml files processing hadoop comes with a class for this purpose called
StreamXmlRecordReader,You can use it by setting your input format to
StreamInputFormat and setting the
stream.recordreader.class property to
org.apache.hadoop.streaming.StreamXmlRecordReader.

for Json files, an open-source project ElephantBird that contains some
useful utilities for working with LZO compression, has a
LzoJsonInputFormat, which can read JSON, but it requires that the input
file be LZOP compressed. We’ll use this code as a template for our own JSON
InputFormat, which doesn’t have the LZOP compression requirement.

if you are dealing with small files then sequence file format comes in
rescue, it stores sequences of binary key-value pairs. Sequence files are
well suited as a format for MapReduce data since they are
splittable,support compression.


::::::::::::::::::::::::::::::::::::::::
Raj K Singh
http://in.linkedin.com/in/rajkrrsingh
http://www.rajkrrsingh.blogspot.com
Mobile  Tel: +91 (0)9899821370


On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
wolfgang.wyremba@hotmail.com> wrote:

> Hello,
>
> the file format topic is still confusing me and I would appreciate if you
> could share your thoughts and experience with me.
>
> From reading different books/articles/websites I understand that
> - Sequence files (used frequently but not only for binary data),
> - AVRO,
> - RC (was developed to work best with Hive -columnar storage) and
> - ORC (a successor of RC to give Hive another performance boost - Stinger
> initiative)
> are all container file formats to solve the "small files problem" and all
> support compression and splitting.
> Additionally, each file format was developed with specific
> features/benefits
> in mind.
>
> Imagine I have the following text source data
> - 1 TB of XML documents (some millions of small files)
> - 1 TB of JSON documents (some hundred thousands of medium sized files)
> - 1 TB of Apache log files (some thousands of bigger files)
>
> How should I store this data in HDFS to process it using Java MapReduce and
> Pig and Hive?
> I want to use the best tool for my specific problem - with "best"
> performance of course - i.e. maybe one problem on the apache log data can
> be
> best solved using Java MapReduce, another one using Hive or Pig.
>
> Should I simply put the data into HDFS as the data comes from - i.e. as
> plain text files?
> Or should I convert all my data to a container file format like sequence
> files, AVRO, RC or ORC?
>
> Based on this example, I believe
> - the XML documents will be need to be converted to a container file format
> to overcome the "small files problem".
> - the JSON documents could/should not be affected by the "small files
> problem"
> - the Apache files should definitely not be affected by the "small files
> problem", so they could be stored as plain text files.
>
> So, some source data needs to be converted to a container file format,
> others not necessarily.
> But what is really advisable?
>
> Is it advisable to store all data (XML, JSON, Apache logs) in one specific
> container file format in the cluster- let's say you decide to use sequence
> files?
> Having only one file format in HDFS is of course a benefit in terms of
> managing the files and writing Java MapReduce/Pig/Hive code against it.
> Sequence files in this case is certainly not a bad idea, but Hive queries
> could probably better benefit from let's say RC/ORC.
>
> Therefore, is it better to use a mix of plain text files and/or one or more
> container file formats simultaneously?
>
> I know that there will be no crystal-clear answer here as it always
> "depends", but what approach should be taken here, or what is usually used
> in the community out there?
>
> I welcome any feedback and experiences you made.
>
> Thanks
>
>