You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Zheng Shao <zs...@gmail.com> on 2009/04/07 22:50:09 UTC

Re: Keeping Data compressed

Hi all,

We happened to see a similar problem internally and found the reason of the
problem:

If the hadoop distribution is not compiled with "-Dcompile.native=true",
then opening a compressed SequenceFile will fail.
As a result, MoveTask will consider a SequenceFile NOT as a SequenceFile and
bail out.

I just opened HIVE-393 https://issues.apache.org/jira/browse/HIVE-393 to
track this.

Zheng


On Mon, Mar 30, 2009 at 6:45 AM, Johan Oskarsson <jo...@oskarsson.nu> wrote:

> No I haven't had the chance, I'll try to give that a go this week.
>
> /Johan
>
> Stephen Corona wrote:
> > Hey Johan, I was never actually able to get that working with Hive.
> > Have you tested it w/ Hive yet?
> >
> > Sent from my iPhone
> >
> > On Mar 30, 2009, at 9:33 AM, "Johan Oskarsson" <jo...@oskarsson.nu>
> > wrote:
> >
> >> It is actually possible to split Lzo files, that's how we store and
> >> process at lot of log files at Last.fm. For more details see
> >>
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html
> >>
> >> Unfortunately the code for lzo compression was removed from future
> >> versions of Hadoop due to licensing issues. There is work being done
> >> towards having this code published outside of Hadoop to allow people
> >> to
> >> use it anyway.
> >>
> >> /Johan
> >>
> >> Joydeep Sen Sarma wrote:
> >>> As we mentioned earlier - compressed textfiles (with gz or lzo
> >>> compression) cannot be split during the map phase Depending on the
> >>> size of the files on hdfs u end up with - this may or may not be a
> >>> good idea.
> >>>
> >>> It's not just that there is limited map parallelism - but worse
> >>> that with large map inputs - the mappers must sort large amount of
> >>> data and start spilling to disk. Whenever this happens - jobs can
> >>> become very slow.
> >>>
> >>> -----Original Message-----
> >>> From: Bob Schulze [mailto:b.schulze@ecircle.com]
> >>> Sent: Thursday, March 19, 2009 10:20 AM
> >>> To: hive-user@hadoop.apache.org
> >>> Subject: Re: Keeping Data compressed
> >>>
> >>> Joydeep Sen Sarma schrieb:
> >>>> Can't reproduce this. can u run explain on the insert query and
> >>>> post the results?
> >>>>
> >>> I'll do this but meanwhile I figured out that it doesnt need sequence
> >>> files to get compression. I just stay with textfiles:
> >>>
> >>> 1. hadoop putFromLocal f1
> >>> 2. create table t1 as textfile
> >>> 3. load into t1 from f1
> >>>   -> t1 is textfile, uncompressed
> >>> 4. create table t2 as textfile
> >>> 5. from t1 copy into t2 select *
> >>>   -> t2 is compressed (as to my hadoop/hive settings)
> >>>
> >>> So my original desire is fullfilled, thank you all for your help.
> >>>
> >>> Still, it raises more questions to me:
> >>> a) Would I benefit somehow from sequencefiles? My Hive Queries run
> >>> faster as expected with compression...
> >>> b) More interesting: The namenode web page shows many files in the
> >>> hive
> >>> warehouse directory of t2 - the MR output from the input into t2 I
> >>> assume. But the size is now the compressed size, e.g. 10 Mb. The
> >>> block
> >>> size is still 64Mb -> isnt this a waste of space?
> >>>
> >>> Bob
> >>>
> >>>> -----Original Message-----
> >>>> From: Bob Schulze [mailto:b.schulze@ecircle.com]
> >>>> Sent: Thursday, March 19, 2009 3:05 AM
> >>>> To: hive-user@hadoop.apache.org
> >>>> Subject: Re: Keeping Data compressed
> >>>>
> >>>> Repeated it again, it fails in the last step
> >>>>
> >>>>    INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;
> >>>>
> >>>> with the same message:
> >>>>
> >>>> ...
> >>>> Loading data to table t2
> >>>> Failed with exception Cannot load text files into a table stored as
> >>>> SequenceFile.
> >>>> FAILED: Execution Error, return code 1 from
> >>>> org.apache.hadoop.hive.ql.exec.MoveTask
> >>>> ...
> >>>>
> >>>> There is indeed such a check in MoveTask.java. MoveTask seems
> >>>> always to
> >>>> be choosen, no matter what I try in the select statement.
> >>>>
> >>>> Bob
> >>>>
> >>>> Zheng Shao schrieb:
> >>>>> Hi Bob,
> >>>>>
> >>>>> The reason that you see that "Failed with exception Cannot load
> >>>>> text
> >>>>> files into a table stored as
> >>>>> SequenceFile" is because you are trying to load text files into a
> >>>>> table
> >>>>> declared with "stored as sequencefile".
> >>>>>
> >>>>> Let me put all the commands that you need together:
> >>>>>
> >>>>> CREATE TABLE texttable (...) STORED AS TEXTFILE;
> >>>>> LOAD DATA ... OVERWRITE INTO texttable;
> >>>>> CREATE TABLE seqtable (...) STORED AS SEQUENCEFILE;
> >>>>> set hive.exec.compress.output=true;
> >>>>> set
> >>>>> mapred.
> >>>>> output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> >>>>> set mapred.output.compression.type=BLOCK;
> >>>>> INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;
> >>>>>
> >>>>> Let me know if this works or not. If not, please let me know
> >>>>> which step
> >>>>> goes wrong and the error message.
> >>>>>
> >>>>> Zheng
> >>>>>
> >>>>> On Thu, Mar 19, 2009 at 1:34 AM, Bob Schulze <b.schulze@ecircle.com
> >>>>> <ma...@ecircle.com>> wrote:
> >>>>>
> >>>>>    Thx Joydeep,
> >>>>>
> >>>>>           I actually tried that way; in all combinations (file-
> >>>>>> seq table,
> >>>>>    file->txt table->seq table) I end up with a
> >>>>>
> >>>>>    "Failed with exception Cannot load text files into a table
> >>>>> stored as
> >>>>>    SequenceFile.
> >>>>>    FAILED: Execution Error, return code 1 from
> >>>>>    org.apache.hadoop.hive.ql.exec.MoveTask"
> >>>>>
> >>>>>    The path you propose _is_ working if if compression is
> >>>>> disabled ( I see
> >>>>>    then that a sequence file is created in hdfs). Does the
> >>>>> compression
> >>>>>    setting for hadoop (mapred.compress.map.output=true) possibly
> >>>>> conflict
> >>>>>    with the hive setting (hive.exec.compress.output=true)?
> >>>>>
> >>>>>    Beside that I wonder how Hive deals with the key/value records
> >>>>> in a
> >>>>>    sequence file.
> >>>>>
> >>>>>    Bob
> >>>>>
> >>>>>    Joydeep Sen Sarma schrieb:
> >>>>>> Hey - not sure if anyone responded.
> >>>>>>
> >>>>>> Sequencefiles are the way to go if u want parallelism on the files
> >>>>>    as well (since gz compressed files cannot be split).
> >>>>>> One simple way to do this is to start with text files, build
> >>>>>    (potentially an external) table on them - and load them into
> >>>>> another
> >>>>>    table that is declared to be stored as a sequencefile. the
> >>>>> load can
> >>>>>    simply be a 'insert overwrite table XXX select * from YYY' on
> >>>>> the
> >>>>>    first table (YYY). The first table is just a tmp table used to
> >>>>> do
> >>>>>    the loading.
> >>>>>> Whether the data is compressed or not as a result is controlled by
> >>>>>    the hive option 'hive.exec.compress.output'. if this is set to
> >>>>> true
> >>>>>    - the codec used is whatever is dictated by hadoop options that
> >>>>>    control the codec. The relevant options are:
> >>>>>> mapred.output.compression.codec
> >>>>>> mapred.output.compression.type
> >>>>>>
> >>>>>> u want to set them to org.apache.hadoop.io.compress.GzipCodec and
> >>>>>    BLOCK respectively.
> >>>>>> Hope this helps,
> >>>>>>
> >>>>>> Joydeep
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Bob Schulze [mailto:b.schulze@ecircle.com
> >>>>>    <ma...@ecircle.com>]
> >>>>>> Sent: Wednesday, March 18, 2009 8:07 AM
> >>>>>> To: hive-user@hadoop.apache.org <mailto:hive-user@hadoop.apache.org
> >>>>>> Subject: Keeping Data compressed
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>      I want to keep data in hadoop compressed, ready for
> >>>>>    hive-selects to
> >>>>>> access.
> >>>>>>
> >>>>>> Is using sequencefiles with compression the way to go?
> >>>>>>
> >>>>>> How can I get my data into hive tables "as sequencefile", with an
> >>>>>> underlaying compression?
> >>>>>>
> >>>>>> Thx for any ideas,
> >>>>>>
> >>>>>>      Bob
> >>>>>>
> >>>>>
> >>>>>    --
> >>>>>
> >>>>>           Bob Schulze
> >>>>>           Head Software Development
> >>>>>           eCircle AG, Munich, Germany
> >>>>>           +49-89-12009-703
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Yours,
> >>>>> Zheng
> >>>
>
>


-- 
Yours,
Zheng

Re: Keeping Data compressed

Posted by Zheng Shao <zs...@gmail.com>.

HIVE-393 is resolved. Please try it again. This problem should not happen
any more.

Zheng

On Tue, Apr 7, 2009 at 1:50 PM, Zheng Shao <zs...@gmail.com> wrote:

> Hi all,
>
> We happened to see a similar problem internally and found the reason of the
> problem:
>
> If the hadoop distribution is not compiled with "-Dcompile.native=true",
> then opening a compressed SequenceFile will fail.
> As a result, MoveTask will consider a SequenceFile NOT as a SequenceFile
> and bail out.
>
> I just opened HIVE-393 https://issues.apache.org/jira/browse/HIVE-393 to
> track this.
>
> Zheng
>
>
>
> On Mon, Mar 30, 2009 at 6:45 AM, Johan Oskarsson <jo...@oskarsson.nu>wrote:
>
>> No I haven't had the chance, I'll try to give that a go this week.
>>
>> /Johan
>>
>> Stephen Corona wrote:
>> > Hey Johan, I was never actually able to get that working with Hive.
>> > Have you tested it w/ Hive yet?
>> >
>> > Sent from my iPhone
>> >
>> > On Mar 30, 2009, at 9:33 AM, "Johan Oskarsson" <jo...@oskarsson.nu>
>> > wrote:
>> >
>> >> It is actually possible to split Lzo files, that's how we store and
>> >> process at lot of log files at Last.fm. For more details see
>> >>
>> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html
>> >>
>> >> Unfortunately the code for lzo compression was removed from future
>> >> versions of Hadoop due to licensing issues. There is work being done
>> >> towards having this code published outside of Hadoop to allow people
>> >> to
>> >> use it anyway.
>> >>
>> >> /Johan
>> >>
>> >> Joydeep Sen Sarma wrote:
>> >>> As we mentioned earlier - compressed textfiles (with gz or lzo
>> >>> compression) cannot be split during the map phase Depending on the
>> >>> size of the files on hdfs u end up with - this may or may not be a
>> >>> good idea.
>> >>>
>> >>> It's not just that there is limited map parallelism - but worse
>> >>> that with large map inputs - the mappers must sort large amount of
>> >>> data and start spilling to disk. Whenever this happens - jobs can
>> >>> become very slow.
>> >>>
>> >>> -----Original Message-----
>> >>> From: Bob Schulze [mailto:b.schulze@ecircle.com]
>> >>> Sent: Thursday, March 19, 2009 10:20 AM
>> >>> To: hive-user@hadoop.apache.org
>> >>> Subject: Re: Keeping Data compressed
>> >>>
>> >>> Joydeep Sen Sarma schrieb:
>> >>>> Can't reproduce this. can u run explain on the insert query and
>> >>>> post the results?
>> >>>>
>> >>> I'll do this but meanwhile I figured out that it doesnt need sequence
>> >>> files to get compression. I just stay with textfiles:
>> >>>
>> >>> 1. hadoop putFromLocal f1
>> >>> 2. create table t1 as textfile
>> >>> 3. load into t1 from f1
>> >>>   -> t1 is textfile, uncompressed
>> >>> 4. create table t2 as textfile
>> >>> 5. from t1 copy into t2 select *
>> >>>   -> t2 is compressed (as to my hadoop/hive settings)
>> >>>
>> >>> So my original desire is fullfilled, thank you all for your help.
>> >>>
>> >>> Still, it raises more questions to me:
>> >>> a) Would I benefit somehow from sequencefiles? My Hive Queries run
>> >>> faster as expected with compression...
>> >>> b) More interesting: The namenode web page shows many files in the
>> >>> hive
>> >>> warehouse directory of t2 - the MR output from the input into t2 I
>> >>> assume. But the size is now the compressed size, e.g. 10 Mb. The
>> >>> block
>> >>> size is still 64Mb -> isnt this a waste of space?
>> >>>
>> >>> Bob
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Bob Schulze [mailto:b.schulze@ecircle.com]
>> >>>> Sent: Thursday, March 19, 2009 3:05 AM
>> >>>> To: hive-user@hadoop.apache.org
>> >>>> Subject: Re: Keeping Data compressed
>> >>>>
>> >>>> Repeated it again, it fails in the last step
>> >>>>
>> >>>>    INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;
>> >>>>
>> >>>> with the same message:
>> >>>>
>> >>>> ...
>> >>>> Loading data to table t2
>> >>>> Failed with exception Cannot load text files into a table stored as
>> >>>> SequenceFile.
>> >>>> FAILED: Execution Error, return code 1 from
>> >>>> org.apache.hadoop.hive.ql.exec.MoveTask
>> >>>> ...
>> >>>>
>> >>>> There is indeed such a check in MoveTask.java. MoveTask seems
>> >>>> always to
>> >>>> be choosen, no matter what I try in the select statement.
>> >>>>
>> >>>> Bob
>> >>>>
>> >>>> Zheng Shao schrieb:
>> >>>>> Hi Bob,
>> >>>>>
>> >>>>> The reason that you see that "Failed with exception Cannot load
>> >>>>> text
>> >>>>> files into a table stored as
>> >>>>> SequenceFile" is because you are trying to load text files into a
>> >>>>> table
>> >>>>> declared with "stored as sequencefile".
>> >>>>>
>> >>>>> Let me put all the commands that you need together:
>> >>>>>
>> >>>>> CREATE TABLE texttable (...) STORED AS TEXTFILE;
>> >>>>> LOAD DATA ... OVERWRITE INTO texttable;
>> >>>>> CREATE TABLE seqtable (...) STORED AS SEQUENCEFILE;
>> >>>>> set hive.exec.compress.output=true;
>> >>>>> set
>> >>>>> mapred.
>> >>>>> output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>> >>>>> set mapred.output.compression.type=BLOCK;
>> >>>>> INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;
>> >>>>>
>> >>>>> Let me know if this works or not. If not, please let me know
>> >>>>> which step
>> >>>>> goes wrong and the error message.
>> >>>>>
>> >>>>> Zheng
>> >>>>>
>> >>>>> On Thu, Mar 19, 2009 at 1:34 AM, Bob Schulze <b.schulze@ecircle.com
>> >>>>> <ma...@ecircle.com>> wrote:
>> >>>>>
>> >>>>>    Thx Joydeep,
>> >>>>>
>> >>>>>           I actually tried that way; in all combinations (file-
>> >>>>>> seq table,
>> >>>>>    file->txt table->seq table) I end up with a
>> >>>>>
>> >>>>>    "Failed with exception Cannot load text files into a table
>> >>>>> stored as
>> >>>>>    SequenceFile.
>> >>>>>    FAILED: Execution Error, return code 1 from
>> >>>>>    org.apache.hadoop.hive.ql.exec.MoveTask"
>> >>>>>
>> >>>>>    The path you propose _is_ working if if compression is
>> >>>>> disabled ( I see
>> >>>>>    then that a sequence file is created in hdfs). Does the
>> >>>>> compression
>> >>>>>    setting for hadoop (mapred.compress.map.output=true) possibly
>> >>>>> conflict
>> >>>>>    with the hive setting (hive.exec.compress.output=true)?
>> >>>>>
>> >>>>>    Beside that I wonder how Hive deals with the key/value records
>> >>>>> in a
>> >>>>>    sequence file.
>> >>>>>
>> >>>>>    Bob
>> >>>>>
>> >>>>>    Joydeep Sen Sarma schrieb:
>> >>>>>> Hey - not sure if anyone responded.
>> >>>>>>
>> >>>>>> Sequencefiles are the way to go if u want parallelism on the files
>> >>>>>    as well (since gz compressed files cannot be split).
>> >>>>>> One simple way to do this is to start with text files, build
>> >>>>>    (potentially an external) table on them - and load them into
>> >>>>> another
>> >>>>>    table that is declared to be stored as a sequencefile. the
>> >>>>> load can
>> >>>>>    simply be a 'insert overwrite table XXX select * from YYY' on
>> >>>>> the
>> >>>>>    first table (YYY). The first table is just a tmp table used to
>> >>>>> do
>> >>>>>    the loading.
>> >>>>>> Whether the data is compressed or not as a result is controlled by
>> >>>>>    the hive option 'hive.exec.compress.output'. if this is set to
>> >>>>> true
>> >>>>>    - the codec used is whatever is dictated by hadoop options that
>> >>>>>    control the codec. The relevant options are:
>> >>>>>> mapred.output.compression.codec
>> >>>>>> mapred.output.compression.type
>> >>>>>>
>> >>>>>> u want to set them to org.apache.hadoop.io.compress.GzipCodec and
>> >>>>>    BLOCK respectively.
>> >>>>>> Hope this helps,
>> >>>>>>
>> >>>>>> Joydeep
>> >>>>>>
>> >>>>>> -----Original Message-----
>> >>>>>> From: Bob Schulze [mailto:b.schulze@ecircle.com
>> >>>>>    <ma...@ecircle.com>]
>> >>>>>> Sent: Wednesday, March 18, 2009 8:07 AM
>> >>>>>> To: hive-user@hadoop.apache.org <mailto:
>> hive-user@hadoop.apache.org
>> >>>>>> Subject: Keeping Data compressed
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>>      I want to keep data in hadoop compressed, ready for
>> >>>>>    hive-selects to
>> >>>>>> access.
>> >>>>>>
>> >>>>>> Is using sequencefiles with compression the way to go?
>> >>>>>>
>> >>>>>> How can I get my data into hive tables "as sequencefile", with an
>> >>>>>> underlaying compression?
>> >>>>>>
>> >>>>>> Thx for any ideas,
>> >>>>>>
>> >>>>>>      Bob
>> >>>>>>
>> >>>>>
>> >>>>>    --
>> >>>>>
>> >>>>>           Bob Schulze
>> >>>>>           Head Software Development
>> >>>>>           eCircle AG, Munich, Germany
>> >>>>>           +49-89-12009-703
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Yours,
>> >>>>> Zheng
>> >>>
>>
>>
>
>
> --
> Yours,
> Zheng
>



-- 
Yours,
Zheng