You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Brian Stempin <bs...@50onred.com> on 2014/06/17 22:48:54 UTC

Combining small S3 inputs

Hi,
I was comparing performance of a Hadoop job that I wrote in Java to one
that I wrote in Pig.  I have ~106,000 small (<1Mb) input files.  In my Java
job, I get one split per file, which is really inefficient.  In Pig, this
gets done over 49 splits, which is much faster.

How does Pig do this?  Is there a piece of the source code that I can be
referred to?  I seem to be banging my head on how to combine multiple S3
objects into a single split.

Thanks,
Brian

Re: Combining small S3 inputs

Posted by Cheolsoo Park <pi...@gmail.com>.
Pig implements its own input split. It's really a list of underlying input
splits. Take a look at PigSplit.java-
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSplit.java


On Tue, Jun 17, 2014 at 2:02 PM, Brian Stempin <bs...@50onred.com> wrote:

> This was where I started.  I created a class that extends
> CombineFileInputFormat and uses a LineRecordReader.  I don't know if this
> is a bug, but somewhere under the covers, the protocol gets removed from
> the URI and its assumed that the path is an HDFS path.  This causes an
> exception, of course.
>
> I took a look through the Pig source code to see if Pig uses a similar
> tactic to what I was trying, but my search came up dry.
>
> Brian
>

Re: Combining small S3 inputs

Posted by Brian Stempin <bs...@50onred.com>.
This was where I started.  I created a class that extends
CombineFileInputFormat and uses a LineRecordReader.  I don't know if this
is a bug, but somewhere under the covers, the protocol gets removed from
the URI and its assumed that the path is an HDFS path.  This causes an
exception, of course.

I took a look through the Pig source code to see if Pig uses a similar
tactic to what I was trying, but my search came up dry.

Brian

Re: Combining small S3 inputs

Posted by John Meagher <jo...@gmail.com>.
Check out https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html.
I don't know if there's an S3 version, but this should help.

On Tue, Jun 17, 2014 at 4:48 PM, Brian Stempin <bs...@50onred.com> wrote:
> Hi,
> I was comparing performance of a Hadoop job that I wrote in Java to one
> that I wrote in Pig.  I have ~106,000 small (<1Mb) input files.  In my Java
> job, I get one split per file, which is really inefficient.  In Pig, this
> gets done over 49 splits, which is much faster.
>
> How does Pig do this?  Is there a piece of the source code that I can be
> referred to?  I seem to be banging my head on how to combine multiple S3
> objects into a single split.
>
> Thanks,
> Brian