You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by 김영우 <wa...@gmail.com> on 2010/04/22 07:15:18 UTC

HADOOP-4012 and bzip2 input splitting

Hi,

HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
committed. and CHD3 supports bzip2 splitting.
I'm wondering if Hive supports input splitting for bzip2 compreesed text
file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed
files?

Thanks,
Youngwoo

Re: HADOOP-4012 and bzip2 input splitting

Posted by 김영우 <wa...@gmail.com>.
Zheng,

It's 'org.apache.hadoop.hive.ql.io.HiveInputFormat'. and I don't know
exactly MAPREDUCE-830 is in CDH3. but I could not find any clues.

Thanks for your help.

- Youngwoo

2010/4/22 Zheng Shao <zs...@gmail.com>

> Can you take a look at the "job.xml" link in your map-reduce job
> created by Hive and let me know the mapred.input.format.class?
> Is it HiveInputFormat or CombineHiveInputFormat?
>
> It should work if you set it to
> org.apache.hadoop.hive.ql.io.HiveInputFormat
>
> Also, can you verify if
> https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
> distribution or not?
>
> Zheng
>
> On Wed, Apr 21, 2010 at 11:31 PM, 김영우 <wa...@gmail.com> wrote:
> > Zeng,
> >
> > Thanks for your quick reply. but there is only 1 mapper for my job with
> 300
> > MB, bz2 file.
> >
> > I added the following in my core-site.xml
> >
> > <property>
> > <name>io.compression.codecs</name>
> >
> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
> > </property>
> >
> > My table definition:
> >
> > create table test_bzip2
> > (
> > co1 string,
> > .
> > .
> >
> > col20 string
> > )
> > row format delimited
> > fields terminated by '\t'
> > stored as textfile;
> >
> > A simple grouping/count query and the following is the query's plan:
> > STAGE PLANS:
> >   Stage: Stage-1
> >     Map Reduce
> >       Alias -> Map Operator Tree:
> >         test_bzip2
> >           TableScan
> >             alias: test_bzip2
> >             Select Operator
> >               expressions:
> >                     expr: siteid
> >                     type: string
> >               outputColumnNames: siteid
> >               Reduce Output Operator
> >                 key expressions:
> >                       expr: siteid
> >                       type: string
> >                 sort order: +
> >                 Map-reduce partition columns:
> >                       expr: siteid
> >                       type: string
> >                 tag: -1
> >                 value expressions:
> >                       expr: 1
> >                       type: int
> >       Reduce Operator Tree:
> >         Group By Operator
> >           aggregations:
> >                 expr: count(VALUE._col0)
> >           bucketGroup: false
> >           keys:
> >                 expr: KEY._col0
> >                 type: string
> >           mode: complete
> >           outputColumnNames: _col0, _col1
> >           Select Operator
> >             expressions:
> >                   expr: _col0
> >                   type: string
> >                   expr: _col1
> >                   type: bigint
> >             outputColumnNames: _col0, _col1
> >             File Output Operator
> >               compressed: false
> >               GlobalTableId: 0
> >               table:
> >                   input format: org.apache.hadoop.mapred.TextInputFormat
> >                   output format:
> > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> >
> >   Stage: Stage-0
> >     Fetch Operator
> >       limit: -1
> >
> >
> > I just verified bz2 splitting working in my cluster using a simple pig
> > script. the pig script makes 3 mapper for M/R job.
> >
> > What should I check further? Job config info?
> >
> > - Youngwoo
> >
> > 2010/4/22 Zheng Shao <zs...@gmail.com>
> >>
> >> It should be automatically supported. You don't need to do anything
> >> except adding the bzip2 codec in io.compression.codecs in hadoop
> >> configuration files (core-site.xml)
> >>
> >> Zheng
> >>
> >> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has
> been
> >> > committed. and CHD3 supports bzip2 splitting.
> >> > I'm wondering if Hive supports input splitting for bzip2 compreesed
> text
> >> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
> >> > compressed
> >> > files?
> >> >
> >> > Thanks,
> >> > Youngwoo
> >> >
> >>
> >>
> >>
> >> --
> >> Yours,
> >> Zheng
> >> http://www.linkedin.com/in/zshao
> >
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>

Re: HADOOP-4012 and bzip2 input splitting

Posted by Zheng Shao <zs...@gmail.com>.
Can you take a look at the "job.xml" link in your map-reduce job
created by Hive and let me know the mapred.input.format.class?
Is it HiveInputFormat or CombineHiveInputFormat?

It should work if you set it to org.apache.hadoop.hive.ql.io.HiveInputFormat

Also, can you verify if
https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
distribution or not?

Zheng

On Wed, Apr 21, 2010 at 11:31 PM, 김영우 <wa...@gmail.com> wrote:
> Zeng,
>
> Thanks for your quick reply. but there is only 1 mapper for my job with 300
> MB, bz2 file.
>
> I added the following in my core-site.xml
>
> <property>
> <name>io.compression.codecs</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
> </property>
>
> My table definition:
>
> create table test_bzip2
> (
> co1 string,
> .
> .
>
> col20 string
> )
> row format delimited
> fields terminated by '\t'
> stored as textfile;
>
> A simple grouping/count query and the following is the query's plan:
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         test_bzip2
>           TableScan
>             alias: test_bzip2
>             Select Operator
>               expressions:
>                     expr: siteid
>                     type: string
>               outputColumnNames: siteid
>               Reduce Output Operator
>                 key expressions:
>                       expr: siteid
>                       type: string
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: siteid
>                       type: string
>                 tag: -1
>                 value expressions:
>                       expr: 1
>                       type: int
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(VALUE._col0)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: complete
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
>
> I just verified bz2 splitting working in my cluster using a simple pig
> script. the pig script makes 3 mapper for M/R job.
>
> What should I check further? Job config info?
>
> - Youngwoo
>
> 2010/4/22 Zheng Shao <zs...@gmail.com>
>>
>> It should be automatically supported. You don't need to do anything
>> except adding the bzip2 codec in io.compression.codecs in hadoop
>> configuration files (core-site.xml)
>>
>> Zheng
>>
>> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
>> > Hi,
>> >
>> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
>> > committed. and CHD3 supports bzip2 splitting.
>> > I'm wondering if Hive supports input splitting for bzip2 compreesed text
>> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
>> > compressed
>> > files?
>> >
>> > Thanks,
>> > Youngwoo
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>> http://www.linkedin.com/in/zshao
>
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: HADOOP-4012 and bzip2 input splitting

Posted by 김영우 <wa...@gmail.com>.
Zeng,

Thanks for your quick reply. but there is only 1 mapper for my job with 300
MB, bz2 file.

I added the following in my core-site.xml

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

My table definition:

create table test_bzip2
(
co1 string,
.
.

col20 string
)
row format delimited
fields terminated by '\t'
stored as textfile;

A simple grouping/count query and the following is the query's plan:
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        test_bzip2
          TableScan
            alias: test_bzip2
            Select Operator
              expressions:
                    expr: siteid
                    type: string
              outputColumnNames: siteid
              Reduce Output Operator
                key expressions:
                      expr: siteid
                      type: string
                sort order: +
                Map-reduce partition columns:
                      expr: siteid
                      type: string
                tag: -1
                value expressions:
                      expr: 1
                      type: int
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: complete
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


I just verified bz2 splitting working in my cluster using a simple pig
script. the pig script makes 3 mapper for M/R job.

What should I check further? Job config info?

- Youngwoo

2010/4/22 Zheng Shao <zs...@gmail.com>

> It should be automatically supported. You don't need to do anything
> except adding the bzip2 codec in io.compression.codecs in hadoop
> configuration files (core-site.xml)
>
> Zheng
>
> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
> > Hi,
> >
> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
> > committed. and CHD3 supports bzip2 splitting.
> > I'm wondering if Hive supports input splitting for bzip2 compreesed text
> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
> compressed
> > files?
> >
> > Thanks,
> > Youngwoo
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>

Re: HADOOP-4012 and bzip2 input splitting

Posted by Zheng Shao <zs...@gmail.com>.
It should be automatically supported. You don't need to do anything
except adding the bzip2 codec in io.compression.codecs in hadoop
configuration files (core-site.xml)

Zheng

On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
> Hi,
>
> HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
> committed. and CHD3 supports bzip2 splitting.
> I'm wondering if Hive supports input splitting for bzip2 compreesed text
> file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed
> files?
>
> Thanks,
> Youngwoo
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao