You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Chen Wang <ch...@gmail.com> on 2014/02/03 07:14:16 UTC

Hadoop streaming with insert dynamic partition generate many small files

Hi,
I am using java reducer reading from a table, and then write to another one:

  FROM (

                FROM (

                    SELECT column1,...

                    FROM table1

                    WHERE  ( partition > 6 and partition < 12 )

                ) A

                MAP A.column1,A....

                USING 'java -cp .my.jar  mymapper.mymapper'

                AS key, value

                CLUSTER BY key

            ) map_output

            INSERT OVERWRITE TABLE target_table PARTITION(partition)

            REDUCE

                map_output.key,

                map_output.value

            USING 'java -cp .:myjar.jar  myreducer.myreducer'

            AS column1,column2;"

Its all working fine, except that there are many (20-30) small files
generated under each partition. i am setting  SET
hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big
enough file under for each partition.But it does not seem to have any
effect. I still get 20-30 small files under each folder, and each file size
is around 7kb.

How can I force to generate only 1 big file for one partition? Does this
have anything to do with the streaming? I recall in the past i was directly
reading from a table with UDF, and write to another table, it only
generates one big file for the target partition. Not sure why is that.


Any help appreciated.

Thanks,

Chen

RE: Hadoop streaming with insert dynamic partition generate many small files

Posted by "Bogala, Chandra Reddy" <Ch...@gs.com>.

Hi Wang,

    I am first time trying MAP & Reduce inside hive query. Is it possible to share mymapper and myreducer code? So that I can understand how the columns (A.column1,A.... to key, value) converted? Also can you point me to some documents to read more about it.
Thanks,
Chandra

From: Chen Wang [mailto:chen.apache.solr@gmail.com]
Sent: Monday, February 03, 2014 12:26 PM
To: user@hive.apache.org
Subject: Re: Hadoop streaming with insert dynamic partition generate many small files

 it seems that hive.exec.reducers.bytes.per.reducer is still not big enough: I added another 0, and now i only gets one file under each partition.

On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang <ch...@gmail.com>> wrote:
Hi,
I am using java reducer reading from a table, and then write to another one:

  FROM (

                FROM (

                    SELECT column1,...

                    FROM table1

                    WHERE  ( partition > 6 and partition < 12 )

                ) A

                MAP A.column1,A....

                USING 'java -cp .my.jar  mymapper.mymapper'

                AS key, value

                CLUSTER BY key

            ) map_output

            INSERT OVERWRITE TABLE target_table PARTITION(partition)

            REDUCE

                map_output.key,

                map_output.value

            USING 'java -cp .:myjar.jar  myreducer.myreducer'

            AS column1,column2;"

Its all working fine, except that there are many (20-30) small files generated under each partition. i am setting  SET hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big enough file under for each partition.But it does not seem to have any effect. I still get 20-30 small files under each folder, and each file size is around 7kb.

How can I force to generate only 1 big file for one partition? Does this have anything to do with the streaming? I recall in the past i was directly reading from a table with UDF, and write to another table, it only generates one big file for the target partition. Not sure why is that.

Any help appreciated.

Thanks,

Chen

Re: Hadoop streaming with insert dynamic partition generate many small files

Posted by Chen Wang <ch...@gmail.com>.

 it seems that hive.exec.reducers.bytes.per.reducer is still not big
enough: I added another 0, and now i only gets one file under each
partition.


On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang <ch...@gmail.com>wrote:

> Hi,
> I am using java reducer reading from a table, and then write to another
> one:
>
>   FROM (
>
>                 FROM (
>
>                     SELECT column1,...
>
>                     FROM table1
>
>                     WHERE  ( partition > 6 and partition < 12 )
>
>                 ) A
>
>                 MAP A.column1,A....
>
>                 USING 'java -cp .my.jar  mymapper.mymapper'
>
>                 AS key, value
>
>                 CLUSTER BY key
>
>             ) map_output
>
>             INSERT OVERWRITE TABLE target_table PARTITION(partition)
>
>             REDUCE
>
>                 map_output.key,
>
>                 map_output.value
>
>             USING 'java -cp .:myjar.jar  myreducer.myreducer'
>
>             AS column1,column2;"
>
> Its all working fine, except that there are many (20-30) small files
> generated under each partition. i am setting  SET
> hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big
> enough file under for each partition.But it does not seem to have any
> effect. I still get 20-30 small files under each folder, and each file size
> is around 7kb.
>
> How can I force to generate only 1 big file for one partition? Does this
> have anything to do with the streaming? I recall in the past i was directly
> reading from a table with UDF, and write to another table, it only
> generates one big file for the target partition. Not sure why is that.
>
>
> Any help appreciated.
>
> Thanks,
>
> Chen
>
>
>
>