You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by David Ortiz <dp...@gmail.com> on 2015/01/26 22:15:40 UTC

Output Sizing

Hello,

     Is there any way to control output sizing on the crunch pipeline's
write method?  I am processing data which is written to s3 for a program
which cannot handle more than 10-20 MB per file, and am at a loss for how
to do this without writing a hive script to process the data.

Thanks,
     David Ortiz

Re: Output Sizing

Posted by Josh Wills <jw...@cloudera.com>.

Hrm-- maybe something like the AvroPathPerKeyTarget, and a DoFn that
divides the data up into enough keys so that the data associated with a
given key is always < 10MB?

On Mon, Jan 26, 2015 at 1:15 PM, David Ortiz <dp...@gmail.com> wrote:

> Hello,
>
>      Is there any way to control output sizing on the crunch pipeline's
> write method?  I am processing data which is written to s3 for a program
> which cannot handle more than 10-20 MB per file, and am at a loss for how
> to do this without writing a hive script to process the data.
>
> Thanks,
>      David Ortiz
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>