You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Chen Wang <ch...@gmail.com> on 2014/02/03 07:14:16 UTC
Hadoop streaming with insert dynamic partition generate many small files
Hi,
I am using java reducer reading from a table, and then write to another one:
FROM (
FROM (
SELECT column1,...
FROM table1
WHERE ( partition > 6 and partition < 12 )
) A
MAP A.column1,A....
USING 'java -cp .my.jar mymapper.mymapper'
AS key, value
CLUSTER BY key
) map_output
INSERT OVERWRITE TABLE target_table PARTITION(partition)
REDUCE
map_output.key,
map_output.value
USING 'java -cp .:myjar.jar myreducer.myreducer'
AS column1,column2;"
Its all working fine, except that there are many (20-30) small files
generated under each partition. i am setting SET
hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big
enough file under for each partition.But it does not seem to have any
effect. I still get 20-30 small files under each folder, and each file size
is around 7kb.
How can I force to generate only 1 big file for one partition? Does this
have anything to do with the streaming? I recall in the past i was directly
reading from a table with UDF, and write to another table, it only
generates one big file for the target partition. Not sure why is that.
Any help appreciated.
Thanks,
Chen
RE: Hadoop streaming with insert dynamic partition generate many
small files
Posted by "Bogala, Chandra Reddy" <Ch...@gs.com>.
Hi Wang,
I am first time trying MAP & Reduce inside hive query. Is it possible to share mymapper and myreducer code? So that I can understand how the columns (A.column1,A.... to key, value) converted? Also can you point me to some documents to read more about it.
Thanks,
Chandra
From: Chen Wang [mailto:chen.apache.solr@gmail.com]
Sent: Monday, February 03, 2014 12:26 PM
To: user@hive.apache.org
Subject: Re: Hadoop streaming with insert dynamic partition generate many small files
it seems that hive.exec.reducers.bytes.per.reducer is still not big enough: I added another 0, and now i only gets one file under each partition.
On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang <ch...@gmail.com>> wrote:
Hi,
I am using java reducer reading from a table, and then write to another one:
FROM (
FROM (
SELECT column1,...
FROM table1
WHERE ( partition > 6 and partition < 12 )
) A
MAP A.column1,A....
USING 'java -cp .my.jar mymapper.mymapper'
AS key, value
CLUSTER BY key
) map_output
INSERT OVERWRITE TABLE target_table PARTITION(partition)
REDUCE
map_output.key,
map_output.value
USING 'java -cp .:myjar.jar myreducer.myreducer'
AS column1,column2;"
Its all working fine, except that there are many (20-30) small files generated under each partition. i am setting SET hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big enough file under for each partition.But it does not seem to have any effect. I still get 20-30 small files under each folder, and each file size is around 7kb.
How can I force to generate only 1 big file for one partition? Does this have anything to do with the streaming? I recall in the past i was directly reading from a table with UDF, and write to another table, it only generates one big file for the target partition. Not sure why is that.
Any help appreciated.
Thanks,
Chen
Re: Hadoop streaming with insert dynamic partition generate many
small files
Posted by Chen Wang <ch...@gmail.com>.
it seems that hive.exec.reducers.bytes.per.reducer is still not big
enough: I added another 0, and now i only gets one file under each
partition.
On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang <ch...@gmail.com>wrote:
> Hi,
> I am using java reducer reading from a table, and then write to another
> one:
>
> FROM (
>
> FROM (
>
> SELECT column1,...
>
> FROM table1
>
> WHERE ( partition > 6 and partition < 12 )
>
> ) A
>
> MAP A.column1,A....
>
> USING 'java -cp .my.jar mymapper.mymapper'
>
> AS key, value
>
> CLUSTER BY key
>
> ) map_output
>
> INSERT OVERWRITE TABLE target_table PARTITION(partition)
>
> REDUCE
>
> map_output.key,
>
> map_output.value
>
> USING 'java -cp .:myjar.jar myreducer.myreducer'
>
> AS column1,column2;"
>
> Its all working fine, except that there are many (20-30) small files
> generated under each partition. i am setting SET
> hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big
> enough file under for each partition.But it does not seem to have any
> effect. I still get 20-30 small files under each folder, and each file size
> is around 7kb.
>
> How can I force to generate only 1 big file for one partition? Does this
> have anything to do with the streaming? I recall in the past i was directly
> reading from a table with UDF, and write to another table, it only
> generates one big file for the target partition. Not sure why is that.
>
>
> Any help appreciated.
>
> Thanks,
>
> Chen
>
>
>
>