You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by 김영우 <wa...@gmail.com> on 2010/04/22 07:15:18 UTC
HADOOP-4012 and bzip2 input splitting
Hi,
HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
committed. and CHD3 supports bzip2 splitting.
I'm wondering if Hive supports input splitting for bzip2 compreesed text
file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed
files?
Thanks,
Youngwoo
Re: HADOOP-4012 and bzip2 input splitting
Posted by 김영우 <wa...@gmail.com>.
Zheng,
It's 'org.apache.hadoop.hive.ql.io.HiveInputFormat'. and I don't know
exactly MAPREDUCE-830 is in CDH3. but I could not find any clues.
Thanks for your help.
- Youngwoo
2010/4/22 Zheng Shao <zs...@gmail.com>
> Can you take a look at the "job.xml" link in your map-reduce job
> created by Hive and let me know the mapred.input.format.class?
> Is it HiveInputFormat or CombineHiveInputFormat?
>
> It should work if you set it to
> org.apache.hadoop.hive.ql.io.HiveInputFormat
>
> Also, can you verify if
> https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
> distribution or not?
>
> Zheng
>
> On Wed, Apr 21, 2010 at 11:31 PM, 김영우 <wa...@gmail.com> wrote:
> > Zeng,
> >
> > Thanks for your quick reply. but there is only 1 mapper for my job with
> 300
> > MB, bz2 file.
> >
> > I added the following in my core-site.xml
> >
> > <property>
> > <name>io.compression.codecs</name>
> >
> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
> > </property>
> >
> > My table definition:
> >
> > create table test_bzip2
> > (
> > co1 string,
> > .
> > .
> >
> > col20 string
> > )
> > row format delimited
> > fields terminated by '\t'
> > stored as textfile;
> >
> > A simple grouping/count query and the following is the query's plan:
> > STAGE PLANS:
> > Stage: Stage-1
> > Map Reduce
> > Alias -> Map Operator Tree:
> > test_bzip2
> > TableScan
> > alias: test_bzip2
> > Select Operator
> > expressions:
> > expr: siteid
> > type: string
> > outputColumnNames: siteid
> > Reduce Output Operator
> > key expressions:
> > expr: siteid
> > type: string
> > sort order: +
> > Map-reduce partition columns:
> > expr: siteid
> > type: string
> > tag: -1
> > value expressions:
> > expr: 1
> > type: int
> > Reduce Operator Tree:
> > Group By Operator
> > aggregations:
> > expr: count(VALUE._col0)
> > bucketGroup: false
> > keys:
> > expr: KEY._col0
> > type: string
> > mode: complete
> > outputColumnNames: _col0, _col1
> > Select Operator
> > expressions:
> > expr: _col0
> > type: string
> > expr: _col1
> > type: bigint
> > outputColumnNames: _col0, _col1
> > File Output Operator
> > compressed: false
> > GlobalTableId: 0
> > table:
> > input format: org.apache.hadoop.mapred.TextInputFormat
> > output format:
> > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> >
> > Stage: Stage-0
> > Fetch Operator
> > limit: -1
> >
> >
> > I just verified bz2 splitting working in my cluster using a simple pig
> > script. the pig script makes 3 mapper for M/R job.
> >
> > What should I check further? Job config info?
> >
> > - Youngwoo
> >
> > 2010/4/22 Zheng Shao <zs...@gmail.com>
> >>
> >> It should be automatically supported. You don't need to do anything
> >> except adding the bzip2 codec in io.compression.codecs in hadoop
> >> configuration files (core-site.xml)
> >>
> >> Zheng
> >>
> >> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has
> been
> >> > committed. and CHD3 supports bzip2 splitting.
> >> > I'm wondering if Hive supports input splitting for bzip2 compreesed
> text
> >> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
> >> > compressed
> >> > files?
> >> >
> >> > Thanks,
> >> > Youngwoo
> >> >
> >>
> >>
> >>
> >> --
> >> Yours,
> >> Zheng
> >> http://www.linkedin.com/in/zshao
> >
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>
Re: HADOOP-4012 and bzip2 input splitting
Posted by Zheng Shao <zs...@gmail.com>.
Can you take a look at the "job.xml" link in your map-reduce job
created by Hive and let me know the mapred.input.format.class?
Is it HiveInputFormat or CombineHiveInputFormat?
It should work if you set it to org.apache.hadoop.hive.ql.io.HiveInputFormat
Also, can you verify if
https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
distribution or not?
Zheng
On Wed, Apr 21, 2010 at 11:31 PM, 김영우 <wa...@gmail.com> wrote:
> Zeng,
>
> Thanks for your quick reply. but there is only 1 mapper for my job with 300
> MB, bz2 file.
>
> I added the following in my core-site.xml
>
> <property>
> <name>io.compression.codecs</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
> </property>
>
> My table definition:
>
> create table test_bzip2
> (
> co1 string,
> .
> .
>
> col20 string
> )
> row format delimited
> fields terminated by '\t'
> stored as textfile;
>
> A simple grouping/count query and the following is the query's plan:
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> test_bzip2
> TableScan
> alias: test_bzip2
> Select Operator
> expressions:
> expr: siteid
> type: string
> outputColumnNames: siteid
> Reduce Output Operator
> key expressions:
> expr: siteid
> type: string
> sort order: +
> Map-reduce partition columns:
> expr: siteid
> type: string
> tag: -1
> value expressions:
> expr: 1
> type: int
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: count(VALUE._col0)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> mode: complete
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
>
> I just verified bz2 splitting working in my cluster using a simple pig
> script. the pig script makes 3 mapper for M/R job.
>
> What should I check further? Job config info?
>
> - Youngwoo
>
> 2010/4/22 Zheng Shao <zs...@gmail.com>
>>
>> It should be automatically supported. You don't need to do anything
>> except adding the bzip2 codec in io.compression.codecs in hadoop
>> configuration files (core-site.xml)
>>
>> Zheng
>>
>> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
>> > Hi,
>> >
>> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
>> > committed. and CHD3 supports bzip2 splitting.
>> > I'm wondering if Hive supports input splitting for bzip2 compreesed text
>> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
>> > compressed
>> > files?
>> >
>> > Thanks,
>> > Youngwoo
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>> http://www.linkedin.com/in/zshao
>
>
--
Yours,
Zheng
http://www.linkedin.com/in/zshao
Re: HADOOP-4012 and bzip2 input splitting
Posted by 김영우 <wa...@gmail.com>.
Zeng,
Thanks for your quick reply. but there is only 1 mapper for my job with 300
MB, bz2 file.
I added the following in my core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
My table definition:
create table test_bzip2
(
co1 string,
.
.
col20 string
)
row format delimited
fields terminated by '\t'
stored as textfile;
A simple grouping/count query and the following is the query's plan:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
test_bzip2
TableScan
alias: test_bzip2
Select Operator
expressions:
expr: siteid
type: string
outputColumnNames: siteid
Reduce Output Operator
key expressions:
expr: siteid
type: string
sort order: +
Map-reduce partition columns:
expr: siteid
type: string
tag: -1
value expressions:
expr: 1
type: int
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: complete
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
I just verified bz2 splitting working in my cluster using a simple pig
script. the pig script makes 3 mapper for M/R job.
What should I check further? Job config info?
- Youngwoo
2010/4/22 Zheng Shao <zs...@gmail.com>
> It should be automatically supported. You don't need to do anything
> except adding the bzip2 codec in io.compression.codecs in hadoop
> configuration files (core-site.xml)
>
> Zheng
>
> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
> > Hi,
> >
> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
> > committed. and CHD3 supports bzip2 splitting.
> > I'm wondering if Hive supports input splitting for bzip2 compreesed text
> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
> compressed
> > files?
> >
> > Thanks,
> > Youngwoo
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>
Re: HADOOP-4012 and bzip2 input splitting
Posted by Zheng Shao <zs...@gmail.com>.
It should be automatically supported. You don't need to do anything
except adding the bzip2 codec in io.compression.codecs in hadoop
configuration files (core-site.xml)
Zheng
On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <wa...@gmail.com> wrote:
> Hi,
>
> HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
> committed. and CHD3 supports bzip2 splitting.
> I'm wondering if Hive supports input splitting for bzip2 compreesed text
> file(*.bz2). If not, Should I implement a custom SerDe for bzip2 compressed
> files?
>
> Thanks,
> Youngwoo
>
--
Yours,
Zheng
http://www.linkedin.com/in/zshao