You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Loren Siebert <lo...@siebert.org> on 2011/08/15 18:10:36 UTC

Hive MAP/REDUCE/TRANSFORM output creates many small files

I’m running into an issue with Hive’s TRANSFORM where the output always gets split among 32 files. Somebody else also ran into a similar issue and we posted on the CDH group last week (http://bit.ly/nR4tyg), but I’m mentioning it here as it’s Hive-specific.

I'm doing something structurally identical to this sample query from the Hive manual (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform): 
  FROM ( 
    FROM src 
    SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 
'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' 
    USING '/bin/cat' 
    AS (tkey, tvalue) ROW FORMAT SERDE 
'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' 
    RECORDREADER 
'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader' 
  ) tmap 
  INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue 

In my case, 32 reducers are launched, and dest1 always ends up with 32 files. If I set hive.exec.reducers.max=1, it does launch only 1 reducer (instead of 32), but I still get 32 teeny output files. Setting the various "hive.merge.*” options does not seem to have any effect.

Is there something else I should be doing to get the output to be in one large file instead of 32 small ones?

 




Re: Hive MAP/REDUCE/TRANSFORM output creates many small files

Posted by Dave Brondsema <db...@geek.net>.
I think merging the files afterwards is the right approach.  Setting
hive.merge.mapredfiles to true worked for me.  It will still generate many
(eg 32) files, and then it'll run a second job that merges the 32.  Also, in
my queries, I have the TRANSFORM and USING classes after INSERT OVERWRITE.
 I don't know if that makes a difference or not.  Something like this
(untested):

 FROM (
    FROM src
    SELECT key, value
  ) tmap
  INSERT OVERWRITE TABLE dest1
TRANSFORM(key, value) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
    USING '/bin/cat'
    AS (tkey, tvalue) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
    RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'

On Mon, Aug 15, 2011 at 12:10 PM, Loren Siebert <lo...@siebert.org> wrote:

> I’m running into an issue with Hive’s TRANSFORM where the output always
> gets split among 32 files. Somebody else also ran into a similar issue and
> we posted on the CDH group last week (http://bit.ly/nR4tyg), but I’m
> mentioning it here as it’s Hive-specific.
>
> I'm doing something structurally identical to this sample query from the
> Hive manual (
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual
> +Transform):
>   FROM (
>     FROM src
>     SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE
> 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
>     USING '/bin/cat'
>     AS (tkey, tvalue) ROW FORMAT SERDE
> 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
>     RECORDREADER
> 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
>   ) tmap
>   INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue
>
> In my case, 32 reducers are launched, and dest1 always ends up with 32
> files. If I set hive.exec.reducers.max=1, it does launch only 1 reducer
> (instead of 32), but I still get 32 teeny output files. Setting the
> various "hive.merge.*” options does not seem to have any effect.
>
> Is there something else I should be doing to get the output to be in one
> large file instead of 32 small ones?
>
>
>
>
>
>


-- 
Dave Brondsema
Lead Software Engineer - sf.net
Geeknet

====
This e- mail message is intended only for the named recipient(s) above. It may contain confidential and privileged information. If you are not the intended recipient you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by replying to this e-mail and delete the message and any attachment(s) from your system. Thank you.