You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Thilina Gunarathne <cs...@gmail.com> on 2014/01/27 20:05:18 UTC

RCFile vs SequenceFile vs text files

Dear all,
We are trying to pick the right data storage format for the Hive table with
the following requirement and would really appreciate any insights you can
provide to help our decision.

1. ~50Billion records per month. ~14 columns per record and each record is
~100 bytes.  Table is partitioned by the date. Table gets populated
periodically from another Hive query.
2. The columns are dense, so I'm not sure whether we'll get any space
savings by using RCFiles.
3. Data needs to be compressed.
4. We will be doing lot of aggregation queries for selected columns. There
will be ad-hoc queries for whole records as well.
5. We need the ability to run Java MapReduce programs on the underlying
data. We have existing programs which use custom inputformats with
compressed textfiles as input and we are willing to port them to use other
formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
6. Ability to use hive indexing.

thanks a ton in advance,
Thilina


-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Re: RCFile vs SequenceFile vs text files

Posted by Thilina Gunarathne <cs...@gmail.com>.

Thanks Edward. I'm actually populating this table periodically from another
temporary table and OCR sounds like a good fit. But unfortunately we are
stuck with Hive 0.9.

I wonder how easy/hard to use the data stored as RCFile or ORC with Java
MapReduce?

thanks,
Thilina


On Mon, Jan 27, 2014 at 3:09 PM, Edward Capriolo <ed...@gmail.com>wrote:

> The thing about OCR is that it is great for tables created from other
> tables, (like the other columnar formats) but if you are logging directly
> to HDFS, a columnar format is not easy (possible) to write directly.
> Normally people store data in a very direct row oriented form and then
> there first map reduce job buckets/partitions/columnar-izes it.
>
>
> On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <cs...@gmail.com>wrote:
>
>> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
>> not be an option for us as our cluster still runs Hive 0.9 and we won't be
>> migrating any time soon.
>>
>> thanks,
>> Thilina
>>
>>
>> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <sr...@gmail.com>wrote:
>>
>>> Quick insights:
>>>
>>>
>>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>>
>>>
>>>
>>>
>>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>>> Eric.N.Hanson@microsoft.com> wrote:
>>>
>>>>  It sounds like ORC would be best.
>>>>
>>>>
>>>>
>>>>                 -Eric
>>>>
>>>>
>>>>
>>>> *From:* Thilina Gunarathne [mailto:csethil@gmail.com]
>>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>>> *To:* user@hive.apache.org
>>>> *Subject:* RCFile vs SequenceFile vs text files
>>>>
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> We are trying to pick the right data storage format for the Hive table
>>>> with the following requirement and would really appreciate any insights you
>>>> can provide to help our decision.
>>>>
>>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>>> periodically from another Hive query.
>>>>
>>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>>> savings by using RCFiles.
>>>>
>>>> 3. Data needs to be compressed.
>>>>
>>>> 4. We will be doing lot of aggregation queries for selected columns.
>>>> There will be ad-hoc queries for whole records as well.
>>>>
>>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>>> data. We have existing programs which use custom inputformats with
>>>> compressed textfiles as input and we are willing to port them to use other
>>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>>
>>>> 6. Ability to use hive indexing.
>>>>
>>>> thanks a ton in advance,
>>>>
>>>> Thilina
>>>>
>>>>
>>>>
>>>> --
>>>> https://www.cs.indiana.edu/~tgunarat/
>>>> http://www.linkedin.com/in/thilina
>>>>
>>>> http://thilina.gunarathne.org
>>>>
>>>
>>>
>>>
>>> --
>>> Thank you
>>>
>>> Sharath Punreddy
>>> 1201 Golden gate Dr,
>>> Southlake TX 76092.
>>> Phone:626-470-7867
>>>
>>
>>
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>> http://thilina.gunarathne.org
>>
>
>


-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Re: RCFile vs SequenceFile vs text files

Posted by Edward Capriolo <ed...@gmail.com>.

The thing about OCR is that it is great for tables created from other
tables, (like the other columnar formats) but if you are logging directly
to HDFS, a columnar format is not easy (possible) to write directly.
Normally people store data in a very direct row oriented form and then
there first map reduce job buckets/partitions/columnar-izes it.


On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <cs...@gmail.com>wrote:

> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
> not be an option for us as our cluster still runs Hive 0.9 and we won't be
> migrating any time soon.
>
> thanks,
> Thilina
>
>
> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <sr...@gmail.com>wrote:
>
>> Quick insights:
>>
>>
>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>
>>
>>
>>
>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>> Eric.N.Hanson@microsoft.com> wrote:
>>
>>>  It sounds like ORC would be best.
>>>
>>>
>>>
>>>                 -Eric
>>>
>>>
>>>
>>> *From:* Thilina Gunarathne [mailto:csethil@gmail.com]
>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>> *To:* user@hive.apache.org
>>> *Subject:* RCFile vs SequenceFile vs text files
>>>
>>>
>>>
>>> Dear all,
>>>
>>> We are trying to pick the right data storage format for the Hive table
>>> with the following requirement and would really appreciate any insights you
>>> can provide to help our decision.
>>>
>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>> periodically from another Hive query.
>>>
>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>> savings by using RCFiles.
>>>
>>> 3. Data needs to be compressed.
>>>
>>> 4. We will be doing lot of aggregation queries for selected columns.
>>> There will be ad-hoc queries for whole records as well.
>>>
>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>> data. We have existing programs which use custom inputformats with
>>> compressed textfiles as input and we are willing to port them to use other
>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>
>>> 6. Ability to use hive indexing.
>>>
>>> thanks a ton in advance,
>>>
>>> Thilina
>>>
>>>
>>>
>>> --
>>> https://www.cs.indiana.edu/~tgunarat/
>>> http://www.linkedin.com/in/thilina
>>>
>>> http://thilina.gunarathne.org
>>>
>>
>>
>>
>> --
>> Thank you
>>
>> Sharath Punreddy
>> 1201 Golden gate Dr,
>> Southlake TX 76092.
>> Phone:626-470-7867
>>
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>

Re: RCFile vs SequenceFile vs text files

Posted by Edward Capriolo <ed...@gmail.com>.

In general, use Sequence Files + with GZip or Snappy Compression.


On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <cs...@gmail.com>wrote:

> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
> not be an option for us as our cluster still runs Hive 0.9 and we won't be
> migrating any time soon.
>
> thanks,
> Thilina
>
>
> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <sr...@gmail.com>wrote:
>
>> Quick insights:
>>
>>
>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>
>>
>>
>>
>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>> Eric.N.Hanson@microsoft.com> wrote:
>>
>>>  It sounds like ORC would be best.
>>>
>>>
>>>
>>>                 -Eric
>>>
>>>
>>>
>>> *From:* Thilina Gunarathne [mailto:csethil@gmail.com]
>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>> *To:* user@hive.apache.org
>>> *Subject:* RCFile vs SequenceFile vs text files
>>>
>>>
>>>
>>> Dear all,
>>>
>>> We are trying to pick the right data storage format for the Hive table
>>> with the following requirement and would really appreciate any insights you
>>> can provide to help our decision.
>>>
>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>> periodically from another Hive query.
>>>
>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>> savings by using RCFiles.
>>>
>>> 3. Data needs to be compressed.
>>>
>>> 4. We will be doing lot of aggregation queries for selected columns.
>>> There will be ad-hoc queries for whole records as well.
>>>
>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>> data. We have existing programs which use custom inputformats with
>>> compressed textfiles as input and we are willing to port them to use other
>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>
>>> 6. Ability to use hive indexing.
>>>
>>> thanks a ton in advance,
>>>
>>> Thilina
>>>
>>>
>>>
>>> --
>>> https://www.cs.indiana.edu/~tgunarat/
>>> http://www.linkedin.com/in/thilina
>>>
>>> http://thilina.gunarathne.org
>>>
>>
>>
>>
>> --
>> Thank you
>>
>> Sharath Punreddy
>> 1201 Golden gate Dr,
>> Southlake TX 76092.
>> Phone:626-470-7867
>>
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>

Re: RCFile vs SequenceFile vs text files

Posted by Thilina Gunarathne <cs...@gmail.com>.

Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
not be an option for us as our cluster still runs Hive 0.9 and we won't be
migrating any time soon.

thanks,
Thilina


On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <sr...@gmail.com>wrote:

> Quick insights:
>
>
> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>
>
>
>
> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
> Eric.N.Hanson@microsoft.com> wrote:
>
>>  It sounds like ORC would be best.
>>
>>
>>
>>                 -Eric
>>
>>
>>
>> *From:* Thilina Gunarathne [mailto:csethil@gmail.com]
>> *Sent:* Monday, January 27, 2014 11:05 AM
>> *To:* user@hive.apache.org
>> *Subject:* RCFile vs SequenceFile vs text files
>>
>>
>>
>> Dear all,
>>
>> We are trying to pick the right data storage format for the Hive table
>> with the following requirement and would really appreciate any insights you
>> can provide to help our decision.
>>
>> 1. ~50Billion records per month. ~14 columns per record and each record
>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>> periodically from another Hive query.
>>
>> 2. The columns are dense, so I'm not sure whether we'll get any space
>> savings by using RCFiles.
>>
>> 3. Data needs to be compressed.
>>
>> 4. We will be doing lot of aggregation queries for selected columns.
>> There will be ad-hoc queries for whole records as well.
>>
>> 5. We need the ability to run Java MapReduce programs on the underlying
>> data. We have existing programs which use custom inputformats with
>> compressed textfiles as input and we are willing to port them to use other
>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>
>> 6. Ability to use hive indexing.
>>
>> thanks a ton in advance,
>>
>> Thilina
>>
>>
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>>
>> http://thilina.gunarathne.org
>>
>
>
>
> --
> Thank you
>
> Sharath Punreddy
> 1201 Golden gate Dr,
> Southlake TX 76092.
> Phone:626-470-7867
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Re: RCFile vs SequenceFile vs text files

Posted by Sharath Punreddy <sr...@gmail.com>.

Quick insights:

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/




On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
Eric.N.Hanson@microsoft.com> wrote:

>  It sounds like ORC would be best.
>
>
>
>                 -Eric
>
>
>
> *From:* Thilina Gunarathne [mailto:csethil@gmail.com]
> *Sent:* Monday, January 27, 2014 11:05 AM
> *To:* user@hive.apache.org
> *Subject:* RCFile vs SequenceFile vs text files
>
>
>
> Dear all,
>
> We are trying to pick the right data storage format for the Hive table
> with the following requirement and would really appreciate any insights you
> can provide to help our decision.
>
> 1. ~50Billion records per month. ~14 columns per record and each record is
> ~100 bytes.  Table is partitioned by the date. Table gets populated
> periodically from another Hive query.
>
> 2. The columns are dense, so I'm not sure whether we'll get any space
> savings by using RCFiles.
>
> 3. Data needs to be compressed.
>
> 4. We will be doing lot of aggregation queries for selected columns. There
> will be ad-hoc queries for whole records as well.
>
> 5. We need the ability to run Java MapReduce programs on the underlying
> data. We have existing programs which use custom inputformats with
> compressed textfiles as input and we are willing to port them to use other
> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>
> 6. Ability to use hive indexing.
>
> thanks a ton in advance,
>
> Thilina
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
>
> http://thilina.gunarathne.org
>



-- 
Thank you

Sharath Punreddy
1201 Golden gate Dr,
Southlake TX 76092.
Phone:626-470-7867

RE: RCFile vs SequenceFile vs text files

Posted by "Eric Hanson (BIG DATA)" <Er...@microsoft.com>.

It sounds like ORC would be best.

                -Eric

From: Thilina Gunarathne [mailto:csethil@gmail.com]
Sent: Monday, January 27, 2014 11:05 AM
To: user@hive.apache.org
Subject: RCFile vs SequenceFile vs text files

Dear all,
We are trying to pick the right data storage format for the Hive table with the following requirement and would really appreciate any insights you can provide to help our decision.
1. ~50Billion records per month. ~14 columns per record and each record is ~100 bytes.  Table is partitioned by the date. Table gets populated periodically from another Hive query.
2. The columns are dense, so I'm not sure whether we'll get any space savings by using RCFiles.
3. Data needs to be compressed.
4. We will be doing lot of aggregation queries for selected columns. There will be ad-hoc queries for whole records as well.
5. We need the ability to run Java MapReduce programs on the underlying data. We have existing programs which use custom inputformats with compressed textfiles as input and we are willing to port them to use other formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
6. Ability to use hive indexing.
thanks a ton in advance,
Thilina

--
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org