You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Akira Kitada <ak...@gmail.com> on 2009/03/20 05:49:52 UTC

Splitting a big file into pieces with Hadoop Streaming?

Hi,

Can I split a input file into pieces based on the key? (Probably the
hash value of the key)
Considering Hadoop streaming is a kind of shell pipelines,
it seems to be impossible to do this, but I wanted to double-check
this to be sure.

Background: The output(an index file) is so large (more than 10G) that
it slows down my applications using that file without splitting it into pieces.

Thanks in advance.

Re: Splitting a big file into pieces with Hadoop Streaming?

Posted by Norbert Burger <no...@gmail.com>.

If you're trying to split the results of your MR job, seems like one
natural option is to simply add another MR job which post-processes
your data.  The mapper for this second job would just emit as many
unique keys as you want splits, and the value would be your original
data.  Reducer logic would just strip away the keys.

If this is too much work and your files are simple text files, you
could always fall back to head/tail (splits on record boundaries), or
split (splits on column boundaries), or a custom awk script.

Norbert

On Fri, Mar 20, 2009 at 10:10 AM, Nick Cen <ce...@gmail.com> wrote:
> i have a similar problem earlyer, and i just use the split and awk to split
> the file.
>
> 2009/3/20 Akira Kitada <ak...@gmail.com>
>
>> Hi,
>>
>> Can I split a input file into pieces based on the key? (Probably the
>> hash value of the key)
>> Considering Hadoop streaming is a kind of shell pipelines,
>> it seems to be impossible to do this, but I wanted to double-check
>> this to be sure.
>>
>> Background: The output(an index file) is so large (more than 10G) that
>> it slows down my applications using that file without splitting it into
>> pieces.
>>
>> Thanks in advance.
>>
>
>
>
> --
> http://daily.appspot.com/food/
>

Re: Splitting a big file into pieces with Hadoop Streaming?

Posted by Nick Cen <ce...@gmail.com>.

i have a similar problem earlyer, and i just use the split and awk to split
the file.

2009/3/20 Akira Kitada <ak...@gmail.com>

> Hi,
>
> Can I split a input file into pieces based on the key? (Probably the
> hash value of the key)
> Considering Hadoop streaming is a kind of shell pipelines,
> it seems to be impossible to do this, but I wanted to double-check
> this to be sure.
>
> Background: The output(an index file) is so large (more than 10G) that
> it slows down my applications using that file without splitting it into
> pieces.
>
> Thanks in advance.
>



-- 
http://daily.appspot.com/food/