You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Brian Stempin <bs...@50onred.com> on 2014/06/17 22:48:54 UTC
Combining small S3 inputs
Hi,
I was comparing performance of a Hadoop job that I wrote in Java to one
that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java
job, I get one split per file, which is really inefficient. In Pig, this
gets done over 49 splits, which is much faster.
How does Pig do this? Is there a piece of the source code that I can be
referred to? I seem to be banging my head on how to combine multiple S3
objects into a single split.
Thanks,
Brian
Re: Combining small S3 inputs
Posted by Cheolsoo Park <pi...@gmail.com>.
Pig implements its own input split. It's really a list of underlying input
splits. Take a look at PigSplit.java-
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSplit.java
On Tue, Jun 17, 2014 at 2:02 PM, Brian Stempin <bs...@50onred.com> wrote:
> This was where I started. I created a class that extends
> CombineFileInputFormat and uses a LineRecordReader. I don't know if this
> is a bug, but somewhere under the covers, the protocol gets removed from
> the URI and its assumed that the path is an HDFS path. This causes an
> exception, of course.
>
> I took a look through the Pig source code to see if Pig uses a similar
> tactic to what I was trying, but my search came up dry.
>
> Brian
>
Re: Combining small S3 inputs
Posted by Brian Stempin <bs...@50onred.com>.
This was where I started. I created a class that extends
CombineFileInputFormat and uses a LineRecordReader. I don't know if this
is a bug, but somewhere under the covers, the protocol gets removed from
the URI and its assumed that the path is an HDFS path. This causes an
exception, of course.
I took a look through the Pig source code to see if Pig uses a similar
tactic to what I was trying, but my search came up dry.
Brian
Re: Combining small S3 inputs
Posted by John Meagher <jo...@gmail.com>.
Check out https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html.
I don't know if there's an S3 version, but this should help.
On Tue, Jun 17, 2014 at 4:48 PM, Brian Stempin <bs...@50onred.com> wrote:
> Hi,
> I was comparing performance of a Hadoop job that I wrote in Java to one
> that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java
> job, I get one split per file, which is really inefficient. In Pig, this
> gets done over 49 splits, which is much faster.
>
> How does Pig do this? Is there a piece of the source code that I can be
> referred to? I seem to be banging my head on how to combine multiple S3
> objects into a single split.
>
> Thanks,
> Brian