You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Chalcy Raja <Ch...@careerbuilder.com> on 2012/06/26 15:05:49 UTC

hive - snappy and sequence file vs RC file

Hi Hive users,

We are going to use snappy for compression.  

What is the best file format, sequence file or RC file?  Both are splittable and therefore will work well for us.  RC file performance seems to be better than Sequence file.  Sqoop, looks like, may support --as-sequencefile tag sometime in the future, but RC file is not listed in sqoop import.

Any input on this is highly appreciated.

Thanks,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com] 
Sent: Tuesday, June 19, 2012 8:23 AM
To: user@hive.apache.org; 'bejoy_ks@yahoo.com'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Describe formatted tablename is a great DDL.  

For one table sqoop imported into hive table as sequence file, I see the metadata starts with "SEQ-!".  

I created another table like the one which shows SEQ in the metafile and loaded data into this table and I do not see SEQ in the meta data.  I'll try head command and see what is going on.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, June 19, 2012 2:59 AM
To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

When you create a table you do specify the format of how the data is stored in hdfs. This value can be determined using , describe extended or describe formatted at any later point.

Try out
Describe formatted <tableName>;

To ensure the file in hdfs is in SequenceFileFormat, you can check the metadata. Meta data contains information like the compression codec used etc, from the first few characters of a Sequence file . Try linux head command on the sequence file to get those details.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chalcy Raja <Ch...@careerbuilder.com>
Date: Tue, 19 Jun 2012 01:32:52
To: user@hive.apache.org<us...@hive.apache.org>; 'bejoy_ks@yahoo.com'<be...@yahoo.com>
Reply-To: user@hive.apache.org
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index  automatically

I did figure out how to compress data from an uncomressed data in hive table.  I also created a table as sequence file format.  

Is there a way to know if a hive table (hdfs file underneath) is in sequence file format?  Describe extended table does not give the file format.

Thanks,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com]
Sent: Monday, June 18, 2012 3:28 PM
To: user@hive.apache.org; 'bejoy_ks@yahoo.com'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Snappy with sequence file works well for us.  We'll have to decide which one suits our needs.  

Is there a way to convert exiting hdfs in text format to convert to sequence files?

Thanks for all your input,
Chalcy  

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com]
Sent: Monday, June 18, 2012 1:47 PM
To: user@hive.apache.org; 'bejoy_ks@yahoo.com'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

It is there.  I have io.compression.codecs in core-site.xml.  There is not error or warn in the sqoop to hive import which indicates anything.  

The only reason we want to go to lzo is because snappy is not splittable.  

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Monday, June 18, 2012 10:39 AM
To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

Lzo indexing not working, Is Lzo codec class available in 'io.compression.codec' property in core-site.xml?

Snappy is not splittable on its own. But sequence files are splittable so when used together snappy gains the advantage of splittability. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chalcy Raja <Ch...@careerbuilder.com>
Date: Mon, 18 Jun 2012 14:31:36
To: user@hive.apache.org<us...@hive.apache.org>; 'bejoy_ks@yahoo.com'<be...@yahoo.com>
Reply-To: user@hive.apache.org
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index  automatically

Hi Bejoy,

The weird thing is I did not get any errors.  The sqoop import will not go to the second phase where it creates lzo index.

We did deploy the native libraries, except hadoop-lzo lib which we copied after we built in another machine.  We did the same thing on the test machine also.  

I'll try snappy with sequence file also.  Will snappy with sequence file is naturally splittable on the block (one mapper per block)?

Yes, it is cumbersome to create lzo library, then create the file and then create index.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Monday, June 18, 2012 10:04 AM
To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

Did you notice any warnings related to lzo codec on your mapreduce task logs or on sqoop logs? 

It could be because LZO libs are not available on the TaskTracker nodes. These are native libs and are tied to OS, so if you have done an OS upgrade then you need to rebuild and deploy these native libs as well (a simple copy of native libs based older OS may not work as desired).

Like Edward suggested, snappy + sequence is a great combination.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Edward Capriolo <ed...@gmail.com>
Date: Mon, 18 Jun 2012 09:32:01
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Have you considered switching to sequence files using snappy compression (or lzo). IIRC the process of generating LZO files and then generating an index on top of these is cumbersome. When sequence files are directly splittable.

On Mon, Jun 18, 2012 at 9:16 AM, Chalcy Raja <Ch...@careerbuilder.com> wrote:
> I am posting it here first and then may be on sqoop user group as well.
>
>
>
> I am trying to use lzo compression.
>
>
>
> Tested on a standalone by installing cdh3u3 and did sqoop to hive 
> import with lzo compression and everything works great. The data is 
> sqooped into hdfs and lzo index file got created and data is in hive table.
>
>
>
> Did all the lzo necessary steps on the main cluster where the server 
> already has cdh3u3 upgraded previously from cdh3u0 to cdh3u1 to cdh3u2 to cdh3u3.
> Did the same sqoop to hive with lzo compression.  Sqoop to hive works 
> but lzo index is not getting created.
>
>
>
> Need expert opinion. What could be the reason for this behavior. 
> Compared all the versions of hive, sqoop etc., and checked all the configuration.
> Looks like we are missing something.
>
>
>
> Thanks,
>
> Chalcy
>
>
>
>




RE: hive - snappy and sequence file vs RC file

Posted by Chalcy Raja <Ch...@careerbuilder.com>.
Snappy vs LZO - 
To implement lzo, there are several steps, starting from building hadoop-lzo library.  Finally we got it built. Indexing had to be done as a separate step and the lzo indexing does alter the way the files are stored and thus not use hadoop's in built mapper.  Snappy on the other hand comes packages with Cloudera.  Since we are using Cloudera distribution, this makes sense to us.  Lzo compresses better than snappy but for us that was okay since the performance is better with snappy sequence file vs lzo

Rc file vs sequencefile - would have gone with RC file for all the resons given below but for the reason like Bejoy said, sequence file is widely used.  Looks like sqoop may support sequence file with hive import and since we are using sqoop a lot, sequence file is a better choice.   

Also tested going back and forth from one compression to another compression and one file format to another file format since that is possible, we can switch the compression or file format later if we need to.

Thanks,
Chalcy

-----Original Message-----
From: yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Wednesday, June 27, 2012 12:41 AM
To: user@hive.apache.org
Subject: Re: hive - snappy and sequence file vs RC file

Can you share the reason of choosing snappy as your compression codec?
Like @omalley mentioned, RCFile will compress the data more densely, and will avoid reading data not required in your hive query. And I think Facebook use it to store tens of PB (if not hundred PB) of data.

Thanks
Yongqiang
On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley <om...@apache.org> wrote:
> SequenceFile compared to RCFile:
>   * More widely deployed.
>   * Available from MapReduce and Pig
>   * Doesn't compress as small (in RCFile all of each columns values 
> are put
> together)
>   * Uncompresses and deserializes all of the columns, even if you are 
> only reading a few
>
> In either case, for long term storage, you should seriously consider 
> the default codec since that will provide much tighter compression (at 
> the cost of cpu to compress it).
>
> -- Owen


Re: hive - snappy and sequence file vs RC file

Posted by yongqiang he <he...@gmail.com>.
Can you share the reason of choosing snappy as your compression codec?
Like @omalley mentioned, RCFile will compress the data more densely,
and will avoid reading data not required in your hive query. And I
think Facebook use it to store tens of PB (if not hundred PB) of data.

Thanks
Yongqiang
On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley <om...@apache.org> wrote:
> SequenceFile compared to RCFile:
>   * More widely deployed.
>   * Available from MapReduce and Pig
>   * Doesn't compress as small (in RCFile all of each columns values are put
> together)
>   * Uncompresses and deserializes all of the columns, even if you are only
> reading a few
>
> In either case, for long term storage, you should seriously consider the
> default codec since that will provide much tighter compression (at the cost
> of cpu to compress it).
>
> -- Owen

Re: hive - snappy and sequence file vs RC file

Posted by Owen O'Malley <om...@apache.org>.
SequenceFile compared to RCFile:
  * More widely deployed.
  * Available from MapReduce and Pig
  * Doesn't compress as small (in RCFile all of each columns values are put
together)
  * Uncompresses and deserializes all of the columns, even if you are only
reading a few

In either case, for long term storage, you should seriously consider the
default codec since that will provide much tighter compression (at the cost
of cpu to compress it).

-- Owen

RE: hive - snappy and sequence file vs RC file

Posted by Chalcy Raja <Ch...@careerbuilder.com>.
Thanks! Bejoy. I'll let you know which way we are going.

Thanks,
Chalcy

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, June 26, 2012 9:22 AM
To: user@hive.apache.org
Subject: Re: hive - snappy and sequence file vs RC file

Hi Chalcy

AFAIK, RC File format is good when your queries deal with some specific columns and not on the whole data in a row. For a general purpose, Sequence File is a better choice. Also it is widely adopted, so more tools will have support for Sequence Files.

Regards
Bejoy KS

________________________________
From: Chalcy Raja <Ch...@careerbuilder.com>
To: "user@hive.apache.org" <us...@hive.apache.org>
Sent: Tuesday, June 26, 2012 6:35 PM
Subject: hive - snappy and sequence file vs RC file

Hi Hive users,

We are going to use snappy for compression.

What is the best file format, sequence file or RC file?  Both are splittable and therefore will work well for us.  RC file performance seems to be better than Sequence file.  Sqoop, looks like, may support --as-sequencefile tag sometime in the future, but RC file is not listed in sqoop import.

Any input on this is highly appreciated.

Thanks,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com<ma...@careerbuilder.com>]
Sent: Tuesday, June 19, 2012 8:23 AM
To: user@hive.apache.org<ma...@hive.apache.org>; 'bejoy_ks@yahoo.com<ma...@yahoo.com>'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Describe formatted tablename is a great DDL.

For one table sqoop imported into hive table as sequence file, I see the metadata starts with "SEQ-!".

I created another table like the one which shows SEQ in the metafile and loaded data into this table and I do not see SEQ in the meta data.  I'll try head command and see what is going on.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com<ma...@yahoo.com>]
Sent: Tuesday, June 19, 2012 2:59 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

When you create a table you do specify the format of how the data is stored in hdfs. This value can be determined using , describe extended or describe formatted at any later point.

Try out
Describe formatted <tableName>;

To ensure the file in hdfs is in SequenceFileFormat, you can check the metadata. Meta data contains information like the compression codec used etc, from the first few characters of a Sequence file . Try linux head command on the sequence file to get those details.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chalcy Raja <Ch...@careerbuilder.com>>
Date: Tue, 19 Jun 2012 01:32:52
To: user@hive.apache.org<ma...@hive.apache.org>>; 'bejoy_ks@yahoo.com<ma...@yahoo.com>>
Reply-To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index  automatically

I did figure out how to compress data from an uncomressed data in hive table.  I also created a table as sequence file format.

Is there a way to know if a hive table (hdfs file underneath) is in sequence file format?  Describe extended table does not give the file format.

Thanks,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com<ma...@careerbuilder.com>]
Sent: Monday, June 18, 2012 3:28 PM
To: user@hive.apache.org<ma...@hive.apache.org>; 'bejoy_ks@yahoo.com<ma...@yahoo.com>'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Snappy with sequence file works well for us.  We'll have to decide which one suits our needs.

Is there a way to convert exiting hdfs in text format to convert to sequence files?

Thanks for all your input,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com<ma...@careerbuilder.com>]
Sent: Monday, June 18, 2012 1:47 PM
To: user@hive.apache.org<ma...@hive.apache.org>; 'bejoy_ks@yahoo.com<ma...@yahoo.com>'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

It is there.  I have io.compression.codecs in core-site.xml.  There is not error or warn in the sqoop to hive import which indicates anything.

The only reason we want to go to lzo is because snappy is not splittable.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com<ma...@yahoo.com>]
Sent: Monday, June 18, 2012 10:39 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

Lzo indexing not working, Is Lzo codec class available in 'io.compression.codec' property in core-site.xml?

Snappy is not splittable on its own. But sequence files are splittable so when used together snappy gains the advantage of splittability.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chalcy Raja <Ch...@careerbuilder.com>>
Date: Mon, 18 Jun 2012 14:31:36
To: user@hive.apache.org<ma...@hive.apache.org>>; 'bejoy_ks@yahoo.com<ma...@yahoo.com>>
Reply-To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index  automatically

Hi Bejoy,

The weird thing is I did not get any errors.  The sqoop import will not go to the second phase where it creates lzo index.

We did deploy the native libraries, except hadoop-lzo lib which we copied after we built in another machine.  We did the same thing on the test machine also.

I'll try snappy with sequence file also.  Will snappy with sequence file is naturally splittable on the block (one mapper per block)?

Yes, it is cumbersome to create lzo library, then create the file and then create index.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com<ma...@yahoo.com>]
Sent: Monday, June 18, 2012 10:04 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

Did you notice any warnings related to lzo codec on your mapreduce task logs or on sqoop logs?

It could be because LZO libs are not available on the TaskTracker nodes. These are native libs and are tied to OS, so if you have done an OS upgrade then you need to rebuild and deploy these native libs as well (a simple copy of native libs based older OS may not work as desired).

Like Edward suggested, snappy + sequence is a great combination.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Edward Capriolo <ed...@gmail.com>>
Date: Mon, 18 Jun 2012 09:32:01
To: <us...@hive.apache.org>>
Reply-To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Have you considered switching to sequence files using snappy compression (or lzo). IIRC the process of generating LZO files and then generating an index on top of these is cumbersome. When sequence files are directly splittable.

On Mon, Jun 18, 2012 at 9:16 AM, Chalcy Raja <Ch...@careerbuilder.com>> wrote:
> I am posting it here first and then may be on sqoop user group as well.
>
>
>
> I am trying to use lzo compression.
>
>
>
> Tested on a standalone by installing cdh3u3 and did sqoop to hive
> import with lzo compression and everything works great. The data is
> sqooped into hdfs and lzo index file got created and data is in hive table.
>
>
>
> Did all the lzo necessary steps on the main cluster where the server
> already has cdh3u3 upgraded previously from cdh3u0 to cdh3u1 to cdh3u2 to cdh3u3.
> Did the same sqoop to hive with lzo compression.  Sqoop to hive works
> but lzo index is not getting created.
>
>
>
> Need expert opinion. What could be the reason for this behavior.
> Compared all the versions of hive, sqoop etc., and checked all the configuration.
> Looks like we are missing something.
>
>
>
> Thanks,
>
> Chalcy
>
>
>
>





Re: hive - snappy and sequence file vs RC file

Posted by Bejoy Ks <be...@yahoo.com>.
Hi Chalcy

AFAIK, RC File format is good when your queries deal with some specific columns and not on the whole data in a row. For a general purpose, Sequence File is a better choice. Also it is widely adopted, so more tools will have support for Sequence Files.

Regards
Bejoy KS 



________________________________
 From: Chalcy Raja <Ch...@careerbuilder.com>
To: "user@hive.apache.org" <us...@hive.apache.org> 
Sent: Tuesday, June 26, 2012 6:35 PM
Subject: hive - snappy and sequence file vs RC file
 
Hi Hive users,

We are going to use snappy for compression.  

What is the best file format, sequence file or RC file?  Both are splittable and therefore will work well for us.  RC file performance seems to be better than Sequence file.  Sqoop, looks like, may support --as-sequencefile tag sometime in the future, but RC file is not listed in sqoop import.

Any input on this is highly appreciated.

Thanks,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com] 
Sent: Tuesday, June 19, 2012 8:23 AM
To: user@hive.apache.org; 'bejoy_ks@yahoo.com'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Describe formatted tablename is a great DDL.  

For one table sqoop imported into hive table as sequence file, I see the metadata starts with "SEQ-!".  

I created another table like the one which shows SEQ in the metafile and loaded data into this table and I do not see SEQ in the meta data.  I'll try head command and see what is going on.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, June 19, 2012 2:59 AM
To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

When you create a table you do specify the format of how the data is stored in hdfs. This value can be determined using , describe extended or describe formatted at any later point.

Try out
Describe formatted <tableName>;

To ensure the file in hdfs is in SequenceFileFormat, you can check the metadata. Meta data contains information like the compression codec used etc, from the first few characters of a Sequence file . Try linux head command on the sequence file to get those details.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chalcy Raja <Ch...@careerbuilder.com>
Date: Tue, 19 Jun 2012 01:32:52
To: user@hive.apache.org<us...@hive.apache.org>; 'bejoy_ks@yahoo.com'<be...@yahoo.com>
Reply-To: user@hive.apache.org
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index  automatically

I did figure out how to compress data from an uncomressed data in hive table.  I also created a table as sequence file format.  

Is there a way to know if a hive table (hdfs file underneath) is in sequence file format?  Describe extended table does not give the file format.

Thanks,
Chalcy

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com]
Sent: Monday, June 18, 2012 3:28 PM
To: user@hive.apache.org; 'bejoy_ks@yahoo.com'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Snappy with sequence file works well for us.  We'll have to decide which one suits our needs.  

Is there a way to convert exiting hdfs in text format to convert to sequence files?

Thanks for all your input,
Chalcy  

-----Original Message-----
From: Chalcy Raja [mailto:Chalcy.Raja@careerbuilder.com]
Sent: Monday, June 18, 2012 1:47 PM
To: user@hive.apache.org; 'bejoy_ks@yahoo.com'
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

It is there.  I have io.compression.codecs in core-site.xml.  There is not error or warn in the sqoop to hive import which indicates anything.  

The only reason we want to go to lzo is because snappy is not splittable.  

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Monday, June 18, 2012 10:39 AM
To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

Lzo indexing not working, Is Lzo codec class available in 'io.compression.codec' property in core-site.xml?

Snappy is not splittable on its own. But sequence files are splittable so when used together snappy gains the advantage of splittability. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Chalcy Raja <Ch...@careerbuilder.com>
Date: Mon, 18 Jun 2012 14:31:36
To: user@hive.apache.org<us...@hive.apache.org>; 'bejoy_ks@yahoo.com'<be...@yahoo.com>
Reply-To: user@hive.apache.org
Subject: RE: sqoop, hive and lzo and cdh3u3 - not creating in index  automatically

Hi Bejoy,

The weird thing is I did not get any errors.  The sqoop import will not go to the second phase where it creates lzo index.

We did deploy the native libraries, except hadoop-lzo lib which we copied after we built in another machine.  We did the same thing on the test machine also.  

I'll try snappy with sequence file also.  Will snappy with sequence file is naturally splittable on the block (one mapper per block)?

Yes, it is cumbersome to create lzo library, then create the file and then create index.

Thanks,
Chalcy

-----Original Message-----
From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Monday, June 18, 2012 10:04 AM
To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Hi Chalcy

Did you notice any warnings related to lzo codec on your mapreduce task logs or on sqoop logs? 

It could be because LZO libs are not available on the TaskTracker nodes. These are native libs and are tied to OS, so if you have done an OS upgrade then you need to rebuild and deploy these native libs as well (a simple copy of native libs based older OS may not work as desired).

Like Edward suggested, snappy + sequence is a great combination.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Edward Capriolo <ed...@gmail.com>
Date: Mon, 18 Jun 2012 09:32:01
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: sqoop, hive and lzo and cdh3u3 - not creating in index automatically

Have you considered switching to sequence files using snappy compression (or lzo). IIRC the process of generating LZO files and then generating an index on top of these is cumbersome. When sequence files are directly splittable.

On Mon, Jun 18, 2012 at 9:16 AM, Chalcy Raja <Ch...@careerbuilder.com> wrote:
> I am posting it here first and then may be on sqoop user group as well.
>
>
>
> I am trying to use lzo compression.
>
>
>
> Tested on a standalone by installing cdh3u3 and did sqoop to hive 
> import with lzo compression and everything works great. The data is 
> sqooped into hdfs and lzo index file got created and data is in hive table.
>
>
>
> Did all the lzo necessary steps on the main cluster where the server 
> already has cdh3u3 upgraded previously from cdh3u0 to cdh3u1 to cdh3u2 to cdh3u3.
> Did the same sqoop to hive with lzo compression.  Sqoop to hive works 
> but lzo index is not getting created.
>
>
>
> Need expert opinion. What could be the reason for this behavior. 
> Compared all the versions of hive, sqoop etc., and checked all the configuration.
> Looks like we are missing something.
>
>
>
> Thanks,
>
> Chalcy
>
>
>
>