You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Something Something <ma...@gmail.com> on 2013/02/11 19:24:53 UTC

Re: Loader for small files

Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
HBase.  Adding 'hadoop' user group.

On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
mailinglists19@gmail.com> wrote:

> Hello,
>
> We are running into performance issues with Pig/Hadoop because our input
> files are small.  Everything goes to only 1 Mapper.  To get around this, we
> are trying to use our own Loader like this:
>
> 1)  Extend PigStorage:
>
> public class SmallFileStorage extends PigStorage {
>
>     public SmallFileStorage(String delimiter) {
>         super(delimiter);
>     }
>
>     @Override
>     public InputFormat getInputFormat() {
>         return new NLineInputFormat();
>     }
> }
>
>
>
> 2)  Add command line argument to the Pig command as follows:
>
> -Dmapreduce.input.lineinputformat.linespermap=500000
>
>
>
> 3)  Use SmallFileStorage in the Pig script as follows:
>
> USING com.xxx.yyy.SmallFileStorage ('\t')
>
>
> But this doesn't seem to work.  We still see that everything is going to
> one mapper.  Before we spend any more time on this, I am wondering if this
> is a good approach – OR – if there's a better approach?  Please let me
> know.  Thanks.
>
>
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 <ja...@hotmail.com>wrote:

>   Hi, Davie:
>
> I am not sure I understand this suggestion. Why smaller block size will
> help this performance issue?
>
> From what the original question about, it looks like the performance
> problem is due to that there are a lot of small files, and each file will
> run in its own mapper.
>
> As hadoop needs to start a lot of mappers (I think creating a mapper also
> takes time and resource), but each mapper only take small amount of data
> (maybe hundreds K or several M of data, much less than the block size),
> most of the time is wasting on creating task instance for mapper, but each
> mapper finishes very quickly.
>
> This is the reason of performance problem, right? Do I understand the
> problem wrong?
>
> If so, reducing the block size won't help in this case, right? To fix it,
> we need to merge multi-files into one mapper, so let one mapper has enough
> data to process.
>
> Unless my understanding is total wrong, I don't know how reducing block
> size will help in this case.
>
> Thanks
>
> Yong
>
> > Subject: Re: Loader for small files
> > From: davidlabarbera@localresponse.com
> > Date: Mon, 11 Feb 2013 15:38:54 -0500
> > CC: user@hadoop.apache.org
> > To: user@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
> mapper split you want before worrying about optimizing.
> >
> > David
> >
> > On Feb 11, 2013, at 2:10 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> >
> > > David: Your suggestion would add an additional step of copying data
> from
> > > one place to another. Not bad, but not ideal. Is there no way to avoid
> > > copying of data?
> > >
> > > BTW, we have tried changing the following options to no avail :(
> > >
> > > set pig.splitCombination false;
> > >
> > > & a few other 'dfs' options given below:
> > >
> > > mapreduce.min.split.size
> > > mapreduce.max.split.size
> > >
> > > Thanks.
> > >
> > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > > davidlabarbera@localresponse.com> wrote:
> > >
> > >> You could store your data in smaller block sizes. Do something like
> > >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> > >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> > >> You might only need one of those parameters. You can verify the block
> size
> > >> with
> > >> hadoop fsck /small-block-input
> > >>
> > >> In your pig script, you'll probably need to set
> > >> pig.maxCombinedSplitSize
> > >> to something around the block size
> > >>
> > >> David
> > >>
> > >> On Feb 11, 2013, at 1:24 PM, Something Something <
> mailinglists19@gmail.com>
> > >> wrote:
> > >>
> > >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
> related to
> > >>> HBase. Adding 'hadoop' user group.
> > >>>
> > >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > >>> mailinglists19@gmail.com> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> We are running into performance issues with Pig/Hadoop because our
> input
> > >>>> files are small. Everything goes to only 1 Mapper. To get around
> > >> this, we
> > >>>> are trying to use our own Loader like this:
> > >>>>
> > >>>> 1) Extend PigStorage:
> > >>>>
> > >>>> public class SmallFileStorage extends PigStorage {
> > >>>>
> > >>>> public SmallFileStorage(String delimiter) {
> > >>>> super(delimiter);
> > >>>> }
> > >>>>
> > >>>> @Override
> > >>>> public InputFormat getInputFormat() {
> > >>>> return new NLineInputFormat();
> > >>>> }
> > >>>> }
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2) Add command line argument to the Pig command as follows:
> > >>>>
> > >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> > >>>>
> > >>>>
> > >>>>
> > >>>> 3) Use SmallFileStorage in the Pig script as follows:
> > >>>>
> > >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> > >>>>
> > >>>>
> > >>>> But this doesn't seem to work. We still see that everything is
> going to
> > >>>> one mapper. Before we spend any more time on this, I am wondering if
> > >> this
> > >>>> is a good approach – OR – if there's a better approach? Please let
> me
> > >>>> know. Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 <ja...@hotmail.com>wrote:

>   Hi, Davie:
>
> I am not sure I understand this suggestion. Why smaller block size will
> help this performance issue?
>
> From what the original question about, it looks like the performance
> problem is due to that there are a lot of small files, and each file will
> run in its own mapper.
>
> As hadoop needs to start a lot of mappers (I think creating a mapper also
> takes time and resource), but each mapper only take small amount of data
> (maybe hundreds K or several M of data, much less than the block size),
> most of the time is wasting on creating task instance for mapper, but each
> mapper finishes very quickly.
>
> This is the reason of performance problem, right? Do I understand the
> problem wrong?
>
> If so, reducing the block size won't help in this case, right? To fix it,
> we need to merge multi-files into one mapper, so let one mapper has enough
> data to process.
>
> Unless my understanding is total wrong, I don't know how reducing block
> size will help in this case.
>
> Thanks
>
> Yong
>
> > Subject: Re: Loader for small files
> > From: davidlabarbera@localresponse.com
> > Date: Mon, 11 Feb 2013 15:38:54 -0500
> > CC: user@hadoop.apache.org
> > To: user@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
> mapper split you want before worrying about optimizing.
> >
> > David
> >
> > On Feb 11, 2013, at 2:10 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> >
> > > David: Your suggestion would add an additional step of copying data
> from
> > > one place to another. Not bad, but not ideal. Is there no way to avoid
> > > copying of data?
> > >
> > > BTW, we have tried changing the following options to no avail :(
> > >
> > > set pig.splitCombination false;
> > >
> > > & a few other 'dfs' options given below:
> > >
> > > mapreduce.min.split.size
> > > mapreduce.max.split.size
> > >
> > > Thanks.
> > >
> > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > > davidlabarbera@localresponse.com> wrote:
> > >
> > >> You could store your data in smaller block sizes. Do something like
> > >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> > >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> > >> You might only need one of those parameters. You can verify the block
> size
> > >> with
> > >> hadoop fsck /small-block-input
> > >>
> > >> In your pig script, you'll probably need to set
> > >> pig.maxCombinedSplitSize
> > >> to something around the block size
> > >>
> > >> David
> > >>
> > >> On Feb 11, 2013, at 1:24 PM, Something Something <
> mailinglists19@gmail.com>
> > >> wrote:
> > >>
> > >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
> related to
> > >>> HBase. Adding 'hadoop' user group.
> > >>>
> > >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > >>> mailinglists19@gmail.com> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> We are running into performance issues with Pig/Hadoop because our
> input
> > >>>> files are small. Everything goes to only 1 Mapper. To get around
> > >> this, we
> > >>>> are trying to use our own Loader like this:
> > >>>>
> > >>>> 1) Extend PigStorage:
> > >>>>
> > >>>> public class SmallFileStorage extends PigStorage {
> > >>>>
> > >>>> public SmallFileStorage(String delimiter) {
> > >>>> super(delimiter);
> > >>>> }
> > >>>>
> > >>>> @Override
> > >>>> public InputFormat getInputFormat() {
> > >>>> return new NLineInputFormat();
> > >>>> }
> > >>>> }
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2) Add command line argument to the Pig command as follows:
> > >>>>
> > >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> > >>>>
> > >>>>
> > >>>>
> > >>>> 3) Use SmallFileStorage in the Pig script as follows:
> > >>>>
> > >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> > >>>>
> > >>>>
> > >>>> But this doesn't seem to work. We still see that everything is
> going to
> > >>>> one mapper. Before we spend any more time on this, I am wondering if
> > >> this
> > >>>> is a good approach – OR – if there's a better approach? Please let
> me
> > >>>> know. Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 <ja...@hotmail.com>wrote:

>   Hi, Davie:
>
> I am not sure I understand this suggestion. Why smaller block size will
> help this performance issue?
>
> From what the original question about, it looks like the performance
> problem is due to that there are a lot of small files, and each file will
> run in its own mapper.
>
> As hadoop needs to start a lot of mappers (I think creating a mapper also
> takes time and resource), but each mapper only take small amount of data
> (maybe hundreds K or several M of data, much less than the block size),
> most of the time is wasting on creating task instance for mapper, but each
> mapper finishes very quickly.
>
> This is the reason of performance problem, right? Do I understand the
> problem wrong?
>
> If so, reducing the block size won't help in this case, right? To fix it,
> we need to merge multi-files into one mapper, so let one mapper has enough
> data to process.
>
> Unless my understanding is total wrong, I don't know how reducing block
> size will help in this case.
>
> Thanks
>
> Yong
>
> > Subject: Re: Loader for small files
> > From: davidlabarbera@localresponse.com
> > Date: Mon, 11 Feb 2013 15:38:54 -0500
> > CC: user@hadoop.apache.org
> > To: user@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
> mapper split you want before worrying about optimizing.
> >
> > David
> >
> > On Feb 11, 2013, at 2:10 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> >
> > > David: Your suggestion would add an additional step of copying data
> from
> > > one place to another. Not bad, but not ideal. Is there no way to avoid
> > > copying of data?
> > >
> > > BTW, we have tried changing the following options to no avail :(
> > >
> > > set pig.splitCombination false;
> > >
> > > & a few other 'dfs' options given below:
> > >
> > > mapreduce.min.split.size
> > > mapreduce.max.split.size
> > >
> > > Thanks.
> > >
> > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > > davidlabarbera@localresponse.com> wrote:
> > >
> > >> You could store your data in smaller block sizes. Do something like
> > >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> > >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> > >> You might only need one of those parameters. You can verify the block
> size
> > >> with
> > >> hadoop fsck /small-block-input
> > >>
> > >> In your pig script, you'll probably need to set
> > >> pig.maxCombinedSplitSize
> > >> to something around the block size
> > >>
> > >> David
> > >>
> > >> On Feb 11, 2013, at 1:24 PM, Something Something <
> mailinglists19@gmail.com>
> > >> wrote:
> > >>
> > >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
> related to
> > >>> HBase. Adding 'hadoop' user group.
> > >>>
> > >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > >>> mailinglists19@gmail.com> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> We are running into performance issues with Pig/Hadoop because our
> input
> > >>>> files are small. Everything goes to only 1 Mapper. To get around
> > >> this, we
> > >>>> are trying to use our own Loader like this:
> > >>>>
> > >>>> 1) Extend PigStorage:
> > >>>>
> > >>>> public class SmallFileStorage extends PigStorage {
> > >>>>
> > >>>> public SmallFileStorage(String delimiter) {
> > >>>> super(delimiter);
> > >>>> }
> > >>>>
> > >>>> @Override
> > >>>> public InputFormat getInputFormat() {
> > >>>> return new NLineInputFormat();
> > >>>> }
> > >>>> }
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2) Add command line argument to the Pig command as follows:
> > >>>>
> > >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> > >>>>
> > >>>>
> > >>>>
> > >>>> 3) Use SmallFileStorage in the Pig script as follows:
> > >>>>
> > >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> > >>>>
> > >>>>
> > >>>> But this doesn't seem to work. We still see that everything is
> going to
> > >>>> one mapper. Before we spend any more time on this, I am wondering if
> > >> this
> > >>>> is a good approach – OR – if there's a better approach? Please let
> me
> > >>>> know. Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 <ja...@hotmail.com>wrote:

>   Hi, Davie:
>
> I am not sure I understand this suggestion. Why smaller block size will
> help this performance issue?
>
> From what the original question about, it looks like the performance
> problem is due to that there are a lot of small files, and each file will
> run in its own mapper.
>
> As hadoop needs to start a lot of mappers (I think creating a mapper also
> takes time and resource), but each mapper only take small amount of data
> (maybe hundreds K or several M of data, much less than the block size),
> most of the time is wasting on creating task instance for mapper, but each
> mapper finishes very quickly.
>
> This is the reason of performance problem, right? Do I understand the
> problem wrong?
>
> If so, reducing the block size won't help in this case, right? To fix it,
> we need to merge multi-files into one mapper, so let one mapper has enough
> data to process.
>
> Unless my understanding is total wrong, I don't know how reducing block
> size will help in this case.
>
> Thanks
>
> Yong
>
> > Subject: Re: Loader for small files
> > From: davidlabarbera@localresponse.com
> > Date: Mon, 11 Feb 2013 15:38:54 -0500
> > CC: user@hadoop.apache.org
> > To: user@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
> mapper split you want before worrying about optimizing.
> >
> > David
> >
> > On Feb 11, 2013, at 2:10 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> >
> > > David: Your suggestion would add an additional step of copying data
> from
> > > one place to another. Not bad, but not ideal. Is there no way to avoid
> > > copying of data?
> > >
> > > BTW, we have tried changing the following options to no avail :(
> > >
> > > set pig.splitCombination false;
> > >
> > > & a few other 'dfs' options given below:
> > >
> > > mapreduce.min.split.size
> > > mapreduce.max.split.size
> > >
> > > Thanks.
> > >
> > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > > davidlabarbera@localresponse.com> wrote:
> > >
> > >> You could store your data in smaller block sizes. Do something like
> > >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> > >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> > >> You might only need one of those parameters. You can verify the block
> size
> > >> with
> > >> hadoop fsck /small-block-input
> > >>
> > >> In your pig script, you'll probably need to set
> > >> pig.maxCombinedSplitSize
> > >> to something around the block size
> > >>
> > >> David
> > >>
> > >> On Feb 11, 2013, at 1:24 PM, Something Something <
> mailinglists19@gmail.com>
> > >> wrote:
> > >>
> > >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
> related to
> > >>> HBase. Adding 'hadoop' user group.
> > >>>
> > >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > >>> mailinglists19@gmail.com> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> We are running into performance issues with Pig/Hadoop because our
> input
> > >>>> files are small. Everything goes to only 1 Mapper. To get around
> > >> this, we
> > >>>> are trying to use our own Loader like this:
> > >>>>
> > >>>> 1) Extend PigStorage:
> > >>>>
> > >>>> public class SmallFileStorage extends PigStorage {
> > >>>>
> > >>>> public SmallFileStorage(String delimiter) {
> > >>>> super(delimiter);
> > >>>> }
> > >>>>
> > >>>> @Override
> > >>>> public InputFormat getInputFormat() {
> > >>>> return new NLineInputFormat();
> > >>>> }
> > >>>> }
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2) Add command line argument to the Pig command as follows:
> > >>>>
> > >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> > >>>>
> > >>>>
> > >>>>
> > >>>> 3) Use SmallFileStorage in the Pig script as follows:
> > >>>>
> > >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> > >>>>
> > >>>>
> > >>>> But this doesn't seem to work. We still see that everything is
> going to
> > >>>> one mapper. Before we spend any more time on this, I am wondering if
> > >> this
> > >>>> is a good approach – OR – if there's a better approach? Please let
> me
> > >>>> know. Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 <ja...@hotmail.com>wrote:

>   Hi, Davie:
>
> I am not sure I understand this suggestion. Why smaller block size will
> help this performance issue?
>
> From what the original question about, it looks like the performance
> problem is due to that there are a lot of small files, and each file will
> run in its own mapper.
>
> As hadoop needs to start a lot of mappers (I think creating a mapper also
> takes time and resource), but each mapper only take small amount of data
> (maybe hundreds K or several M of data, much less than the block size),
> most of the time is wasting on creating task instance for mapper, but each
> mapper finishes very quickly.
>
> This is the reason of performance problem, right? Do I understand the
> problem wrong?
>
> If so, reducing the block size won't help in this case, right? To fix it,
> we need to merge multi-files into one mapper, so let one mapper has enough
> data to process.
>
> Unless my understanding is total wrong, I don't know how reducing block
> size will help in this case.
>
> Thanks
>
> Yong
>
> > Subject: Re: Loader for small files
> > From: davidlabarbera@localresponse.com
> > Date: Mon, 11 Feb 2013 15:38:54 -0500
> > CC: user@hadoop.apache.org
> > To: user@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
> mapper split you want before worrying about optimizing.
> >
> > David
> >
> > On Feb 11, 2013, at 2:10 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> >
> > > David: Your suggestion would add an additional step of copying data
> from
> > > one place to another. Not bad, but not ideal. Is there no way to avoid
> > > copying of data?
> > >
> > > BTW, we have tried changing the following options to no avail :(
> > >
> > > set pig.splitCombination false;
> > >
> > > & a few other 'dfs' options given below:
> > >
> > > mapreduce.min.split.size
> > > mapreduce.max.split.size
> > >
> > > Thanks.
> > >
> > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > > davidlabarbera@localresponse.com> wrote:
> > >
> > >> You could store your data in smaller block sizes. Do something like
> > >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> > >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> > >> You might only need one of those parameters. You can verify the block
> size
> > >> with
> > >> hadoop fsck /small-block-input
> > >>
> > >> In your pig script, you'll probably need to set
> > >> pig.maxCombinedSplitSize
> > >> to something around the block size
> > >>
> > >> David
> > >>
> > >> On Feb 11, 2013, at 1:24 PM, Something Something <
> mailinglists19@gmail.com>
> > >> wrote:
> > >>
> > >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
> related to
> > >>> HBase. Adding 'hadoop' user group.
> > >>>
> > >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > >>> mailinglists19@gmail.com> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> We are running into performance issues with Pig/Hadoop because our
> input
> > >>>> files are small. Everything goes to only 1 Mapper. To get around
> > >> this, we
> > >>>> are trying to use our own Loader like this:
> > >>>>
> > >>>> 1) Extend PigStorage:
> > >>>>
> > >>>> public class SmallFileStorage extends PigStorage {
> > >>>>
> > >>>> public SmallFileStorage(String delimiter) {
> > >>>> super(delimiter);
> > >>>> }
> > >>>>
> > >>>> @Override
> > >>>> public InputFormat getInputFormat() {
> > >>>> return new NLineInputFormat();
> > >>>> }
> > >>>> }
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2) Add command line argument to the Pig command as follows:
> > >>>>
> > >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> > >>>>
> > >>>>
> > >>>>
> > >>>> 3) Use SmallFileStorage in the Pig script as follows:
> > >>>>
> > >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> > >>>>
> > >>>>
> > >>>> But this doesn't seem to work. We still see that everything is
> going to
> > >>>> one mapper. Before we spend any more time on this, I am wondering if
> > >> this
> > >>>> is a good approach – OR – if there's a better approach? Please let
> me
> > >>>> know. Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
>

RE: Loader for small files

Posted by java8964 java8964 <ja...@hotmail.com>.

 Hi, Davie:
I am not sure I understand this suggestion. Why smaller block size will help this performance issue?
>From what the original question about, it looks like the performance problem is due to that there are a lot of small files, and each file will run in its own mapper.
As hadoop needs to start a lot of mappers (I think creating a mapper also takes time and resource), but each mapper only take small amount of data (maybe hundreds K or several M of data, much less than the block size), most of the time is wasting on creating task instance for mapper, but each mapper finishes very quickly.
This is the reason of performance problem, right? Do I understand the problem wrong?
If so, reducing the block size won't help in this case, right? To fix it, we need to merge multi-files into one mapper, so let one mapper has enough data to process. 
Unless my understanding is total wrong, I don't know how reducing block size will help in this case.
Thanks
Yong

> Subject: Re: Loader for small files
> From: davidlabarbera@localresponse.com
> Date: Mon, 11 Feb 2013 15:38:54 -0500
> CC: user@hadoop.apache.org
> To: user@pig.apache.org
> 
> What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.
> 
> I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.
> 
> David
> 
> On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:
> 
> > David:  Your suggestion would add an additional step of copying data from
> > one place to another.  Not bad, but not ideal.  Is there no way to avoid
> > copying of data?
> > 
> > BTW, we have tried changing the following options to no avail :(
> > 
> > set pig.splitCombination false;
> > 
> > & a few other 'dfs' options given below:
> > 
> > mapreduce.min.split.size
> > mapreduce.max.split.size
> > 
> > Thanks.
> > 
> > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > davidlabarbera@localresponse.com> wrote:
> > 
> >> You could store your data in smaller block sizes. Do something like
> >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> >> You might only need one of those parameters. You can verify the block size
> >> with
> >> hadoop fsck /small-block-input
> >> 
> >> In your pig script, you'll probably need to set
> >> pig.maxCombinedSplitSize
> >> to something around the block size
> >> 
> >> David
> >> 
> >> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> >> wrote:
> >> 
> >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> >>> HBase.  Adding 'hadoop' user group.
> >>> 
> >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> >>> mailinglists19@gmail.com> wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> We are running into performance issues with Pig/Hadoop because our input
> >>>> files are small.  Everything goes to only 1 Mapper.  To get around
> >> this, we
> >>>> are trying to use our own Loader like this:
> >>>> 
> >>>> 1)  Extend PigStorage:
> >>>> 
> >>>> public class SmallFileStorage extends PigStorage {
> >>>> 
> >>>>   public SmallFileStorage(String delimiter) {
> >>>>       super(delimiter);
> >>>>   }
> >>>> 
> >>>>   @Override
> >>>>   public InputFormat getInputFormat() {
> >>>>       return new NLineInputFormat();
> >>>>   }
> >>>> }
> >>>> 
> >>>> 
> >>>> 
> >>>> 2)  Add command line argument to the Pig command as follows:
> >>>> 
> >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>>> 
> >>>> 
> >>>> 
> >>>> 3)  Use SmallFileStorage in the Pig script as follows:
> >>>> 
> >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>>> 
> >>>> 
> >>>> But this doesn't seem to work.  We still see that everything is going to
> >>>> one mapper.  Before we spend any more time on this, I am wondering if
> >> this
> >>>> is a good approach – OR – if there's a better approach?  Please let me
> >>>> know.  Thanks.
> >>>> 
> >>>> 
> >>>> 
> >> 
> >> 
>

RE: Loader for small files

Posted by java8964 java8964 <ja...@hotmail.com>.

 Hi, Davie:
I am not sure I understand this suggestion. Why smaller block size will help this performance issue?
>From what the original question about, it looks like the performance problem is due to that there are a lot of small files, and each file will run in its own mapper.
As hadoop needs to start a lot of mappers (I think creating a mapper also takes time and resource), but each mapper only take small amount of data (maybe hundreds K or several M of data, much less than the block size), most of the time is wasting on creating task instance for mapper, but each mapper finishes very quickly.
This is the reason of performance problem, right? Do I understand the problem wrong?
If so, reducing the block size won't help in this case, right? To fix it, we need to merge multi-files into one mapper, so let one mapper has enough data to process. 
Unless my understanding is total wrong, I don't know how reducing block size will help in this case.
Thanks
Yong

> Subject: Re: Loader for small files
> From: davidlabarbera@localresponse.com
> Date: Mon, 11 Feb 2013 15:38:54 -0500
> CC: user@hadoop.apache.org
> To: user@pig.apache.org
> 
> What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.
> 
> I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.
> 
> David
> 
> On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:
> 
> > David:  Your suggestion would add an additional step of copying data from
> > one place to another.  Not bad, but not ideal.  Is there no way to avoid
> > copying of data?
> > 
> > BTW, we have tried changing the following options to no avail :(
> > 
> > set pig.splitCombination false;
> > 
> > & a few other 'dfs' options given below:
> > 
> > mapreduce.min.split.size
> > mapreduce.max.split.size
> > 
> > Thanks.
> > 
> > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > davidlabarbera@localresponse.com> wrote:
> > 
> >> You could store your data in smaller block sizes. Do something like
> >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> >> You might only need one of those parameters. You can verify the block size
> >> with
> >> hadoop fsck /small-block-input
> >> 
> >> In your pig script, you'll probably need to set
> >> pig.maxCombinedSplitSize
> >> to something around the block size
> >> 
> >> David
> >> 
> >> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> >> wrote:
> >> 
> >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> >>> HBase.  Adding 'hadoop' user group.
> >>> 
> >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> >>> mailinglists19@gmail.com> wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> We are running into performance issues with Pig/Hadoop because our input
> >>>> files are small.  Everything goes to only 1 Mapper.  To get around
> >> this, we
> >>>> are trying to use our own Loader like this:
> >>>> 
> >>>> 1)  Extend PigStorage:
> >>>> 
> >>>> public class SmallFileStorage extends PigStorage {
> >>>> 
> >>>>   public SmallFileStorage(String delimiter) {
> >>>>       super(delimiter);
> >>>>   }
> >>>> 
> >>>>   @Override
> >>>>   public InputFormat getInputFormat() {
> >>>>       return new NLineInputFormat();
> >>>>   }
> >>>> }
> >>>> 
> >>>> 
> >>>> 
> >>>> 2)  Add command line argument to the Pig command as follows:
> >>>> 
> >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>>> 
> >>>> 
> >>>> 
> >>>> 3)  Use SmallFileStorage in the Pig script as follows:
> >>>> 
> >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>>> 
> >>>> 
> >>>> But this doesn't seem to work.  We still see that everything is going to
> >>>> one mapper.  Before we spend any more time on this, I am wondering if
> >> this
> >>>> is a good approach – OR – if there's a better approach?  Please let me
> >>>> know.  Thanks.
> >>>> 
> >>>> 
> >>>> 
> >> 
> >> 
>

RE: Loader for small files

Posted by java8964 java8964 <ja...@hotmail.com>.

 Hi, Davie:
I am not sure I understand this suggestion. Why smaller block size will help this performance issue?
>From what the original question about, it looks like the performance problem is due to that there are a lot of small files, and each file will run in its own mapper.
As hadoop needs to start a lot of mappers (I think creating a mapper also takes time and resource), but each mapper only take small amount of data (maybe hundreds K or several M of data, much less than the block size), most of the time is wasting on creating task instance for mapper, but each mapper finishes very quickly.
This is the reason of performance problem, right? Do I understand the problem wrong?
If so, reducing the block size won't help in this case, right? To fix it, we need to merge multi-files into one mapper, so let one mapper has enough data to process. 
Unless my understanding is total wrong, I don't know how reducing block size will help in this case.
Thanks
Yong

> Subject: Re: Loader for small files
> From: davidlabarbera@localresponse.com
> Date: Mon, 11 Feb 2013 15:38:54 -0500
> CC: user@hadoop.apache.org
> To: user@pig.apache.org
> 
> What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.
> 
> I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.
> 
> David
> 
> On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:
> 
> > David:  Your suggestion would add an additional step of copying data from
> > one place to another.  Not bad, but not ideal.  Is there no way to avoid
> > copying of data?
> > 
> > BTW, we have tried changing the following options to no avail :(
> > 
> > set pig.splitCombination false;
> > 
> > & a few other 'dfs' options given below:
> > 
> > mapreduce.min.split.size
> > mapreduce.max.split.size
> > 
> > Thanks.
> > 
> > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > davidlabarbera@localresponse.com> wrote:
> > 
> >> You could store your data in smaller block sizes. Do something like
> >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> >> You might only need one of those parameters. You can verify the block size
> >> with
> >> hadoop fsck /small-block-input
> >> 
> >> In your pig script, you'll probably need to set
> >> pig.maxCombinedSplitSize
> >> to something around the block size
> >> 
> >> David
> >> 
> >> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> >> wrote:
> >> 
> >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> >>> HBase.  Adding 'hadoop' user group.
> >>> 
> >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> >>> mailinglists19@gmail.com> wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> We are running into performance issues with Pig/Hadoop because our input
> >>>> files are small.  Everything goes to only 1 Mapper.  To get around
> >> this, we
> >>>> are trying to use our own Loader like this:
> >>>> 
> >>>> 1)  Extend PigStorage:
> >>>> 
> >>>> public class SmallFileStorage extends PigStorage {
> >>>> 
> >>>>   public SmallFileStorage(String delimiter) {
> >>>>       super(delimiter);
> >>>>   }
> >>>> 
> >>>>   @Override
> >>>>   public InputFormat getInputFormat() {
> >>>>       return new NLineInputFormat();
> >>>>   }
> >>>> }
> >>>> 
> >>>> 
> >>>> 
> >>>> 2)  Add command line argument to the Pig command as follows:
> >>>> 
> >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>>> 
> >>>> 
> >>>> 
> >>>> 3)  Use SmallFileStorage in the Pig script as follows:
> >>>> 
> >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>>> 
> >>>> 
> >>>> But this doesn't seem to work.  We still see that everything is going to
> >>>> one mapper.  Before we spend any more time on this, I am wondering if
> >> this
> >>>> is a good approach – OR – if there's a better approach?  Please let me
> >>>> know.  Thanks.
> >>>> 
> >>>> 
> >>>> 
> >> 
> >> 
>

RE: Loader for small files

Posted by java8964 java8964 <ja...@hotmail.com>.

 Hi, Davie:
I am not sure I understand this suggestion. Why smaller block size will help this performance issue?
>From what the original question about, it looks like the performance problem is due to that there are a lot of small files, and each file will run in its own mapper.
As hadoop needs to start a lot of mappers (I think creating a mapper also takes time and resource), but each mapper only take small amount of data (maybe hundreds K or several M of data, much less than the block size), most of the time is wasting on creating task instance for mapper, but each mapper finishes very quickly.
This is the reason of performance problem, right? Do I understand the problem wrong?
If so, reducing the block size won't help in this case, right? To fix it, we need to merge multi-files into one mapper, so let one mapper has enough data to process. 
Unless my understanding is total wrong, I don't know how reducing block size will help in this case.
Thanks
Yong

> Subject: Re: Loader for small files
> From: davidlabarbera@localresponse.com
> Date: Mon, 11 Feb 2013 15:38:54 -0500
> CC: user@hadoop.apache.org
> To: user@pig.apache.org
> 
> What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.
> 
> I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.
> 
> David
> 
> On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:
> 
> > David:  Your suggestion would add an additional step of copying data from
> > one place to another.  Not bad, but not ideal.  Is there no way to avoid
> > copying of data?
> > 
> > BTW, we have tried changing the following options to no avail :(
> > 
> > set pig.splitCombination false;
> > 
> > & a few other 'dfs' options given below:
> > 
> > mapreduce.min.split.size
> > mapreduce.max.split.size
> > 
> > Thanks.
> > 
> > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > davidlabarbera@localresponse.com> wrote:
> > 
> >> You could store your data in smaller block sizes. Do something like
> >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> >> You might only need one of those parameters. You can verify the block size
> >> with
> >> hadoop fsck /small-block-input
> >> 
> >> In your pig script, you'll probably need to set
> >> pig.maxCombinedSplitSize
> >> to something around the block size
> >> 
> >> David
> >> 
> >> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> >> wrote:
> >> 
> >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> >>> HBase.  Adding 'hadoop' user group.
> >>> 
> >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> >>> mailinglists19@gmail.com> wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> We are running into performance issues with Pig/Hadoop because our input
> >>>> files are small.  Everything goes to only 1 Mapper.  To get around
> >> this, we
> >>>> are trying to use our own Loader like this:
> >>>> 
> >>>> 1)  Extend PigStorage:
> >>>> 
> >>>> public class SmallFileStorage extends PigStorage {
> >>>> 
> >>>>   public SmallFileStorage(String delimiter) {
> >>>>       super(delimiter);
> >>>>   }
> >>>> 
> >>>>   @Override
> >>>>   public InputFormat getInputFormat() {
> >>>>       return new NLineInputFormat();
> >>>>   }
> >>>> }
> >>>> 
> >>>> 
> >>>> 
> >>>> 2)  Add command line argument to the Pig command as follows:
> >>>> 
> >>>> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>>> 
> >>>> 
> >>>> 
> >>>> 3)  Use SmallFileStorage in the Pig script as follows:
> >>>> 
> >>>> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>>> 
> >>>> 
> >>>> But this doesn't seem to work.  We still see that everything is going to
> >>>> one mapper.  Before we spend any more time on this, I am wondering if
> >> this
> >>>> is a good approach – OR – if there's a better approach?  Please let me
> >>>> know.  Thanks.
> >>>> 
> >>>> 
> >>>> 
> >> 
> >> 
>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.

I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.

David

On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:

> David:  Your suggestion would add an additional step of copying data from
> one place to another.  Not bad, but not ideal.  Is there no way to avoid
> copying of data?
> 
> BTW, we have tried changing the following options to no avail :(
> 
> set pig.splitCombination false;
> 
> & a few other 'dfs' options given below:
> 
> mapreduce.min.split.size
> mapreduce.max.split.size
> 
> Thanks.
> 
> On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> You could store your data in smaller block sizes. Do something like
>> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
>> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
>> You might only need one of those parameters. You can verify the block size
>> with
>> hadoop fsck /small-block-input
>> 
>> In your pig script, you'll probably need to set
>> pig.maxCombinedSplitSize
>> to something around the block size
>> 
>> David
>> 
>> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
>> wrote:
>> 
>>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
>>> HBase.  Adding 'hadoop' user group.
>>> 
>>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We are running into performance issues with Pig/Hadoop because our input
>>>> files are small.  Everything goes to only 1 Mapper.  To get around
>> this, we
>>>> are trying to use our own Loader like this:
>>>> 
>>>> 1)  Extend PigStorage:
>>>> 
>>>> public class SmallFileStorage extends PigStorage {
>>>> 
>>>>   public SmallFileStorage(String delimiter) {
>>>>       super(delimiter);
>>>>   }
>>>> 
>>>>   @Override
>>>>   public InputFormat getInputFormat() {
>>>>       return new NLineInputFormat();
>>>>   }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 2)  Add command line argument to the Pig command as follows:
>>>> 
>>>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>>> 
>>>> 
>>>> 
>>>> 3)  Use SmallFileStorage in the Pig script as follows:
>>>> 
>>>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>>> 
>>>> 
>>>> But this doesn't seem to work.  We still see that everything is going to
>>>> one mapper.  Before we spend any more time on this, I am wondering if
>> this
>>>> is a good approach – OR – if there's a better approach?  Please let me
>>>> know.  Thanks.
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.

I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.

David

On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:

> David:  Your suggestion would add an additional step of copying data from
> one place to another.  Not bad, but not ideal.  Is there no way to avoid
> copying of data?
> 
> BTW, we have tried changing the following options to no avail :(
> 
> set pig.splitCombination false;
> 
> & a few other 'dfs' options given below:
> 
> mapreduce.min.split.size
> mapreduce.max.split.size
> 
> Thanks.
> 
> On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> You could store your data in smaller block sizes. Do something like
>> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
>> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
>> You might only need one of those parameters. You can verify the block size
>> with
>> hadoop fsck /small-block-input
>> 
>> In your pig script, you'll probably need to set
>> pig.maxCombinedSplitSize
>> to something around the block size
>> 
>> David
>> 
>> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
>> wrote:
>> 
>>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
>>> HBase.  Adding 'hadoop' user group.
>>> 
>>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We are running into performance issues with Pig/Hadoop because our input
>>>> files are small.  Everything goes to only 1 Mapper.  To get around
>> this, we
>>>> are trying to use our own Loader like this:
>>>> 
>>>> 1)  Extend PigStorage:
>>>> 
>>>> public class SmallFileStorage extends PigStorage {
>>>> 
>>>>   public SmallFileStorage(String delimiter) {
>>>>       super(delimiter);
>>>>   }
>>>> 
>>>>   @Override
>>>>   public InputFormat getInputFormat() {
>>>>       return new NLineInputFormat();
>>>>   }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 2)  Add command line argument to the Pig command as follows:
>>>> 
>>>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>>> 
>>>> 
>>>> 
>>>> 3)  Use SmallFileStorage in the Pig script as follows:
>>>> 
>>>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>>> 
>>>> 
>>>> But this doesn't seem to work.  We still see that everything is going to
>>>> one mapper.  Before we spend any more time on this, I am wondering if
>> this
>>>> is a good approach – OR – if there's a better approach?  Please let me
>>>> know.  Thanks.
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.

I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.

David

On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:

> David:  Your suggestion would add an additional step of copying data from
> one place to another.  Not bad, but not ideal.  Is there no way to avoid
> copying of data?
> 
> BTW, we have tried changing the following options to no avail :(
> 
> set pig.splitCombination false;
> 
> & a few other 'dfs' options given below:
> 
> mapreduce.min.split.size
> mapreduce.max.split.size
> 
> Thanks.
> 
> On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> You could store your data in smaller block sizes. Do something like
>> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
>> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
>> You might only need one of those parameters. You can verify the block size
>> with
>> hadoop fsck /small-block-input
>> 
>> In your pig script, you'll probably need to set
>> pig.maxCombinedSplitSize
>> to something around the block size
>> 
>> David
>> 
>> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
>> wrote:
>> 
>>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
>>> HBase.  Adding 'hadoop' user group.
>>> 
>>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We are running into performance issues with Pig/Hadoop because our input
>>>> files are small.  Everything goes to only 1 Mapper.  To get around
>> this, we
>>>> are trying to use our own Loader like this:
>>>> 
>>>> 1)  Extend PigStorage:
>>>> 
>>>> public class SmallFileStorage extends PigStorage {
>>>> 
>>>>   public SmallFileStorage(String delimiter) {
>>>>       super(delimiter);
>>>>   }
>>>> 
>>>>   @Override
>>>>   public InputFormat getInputFormat() {
>>>>       return new NLineInputFormat();
>>>>   }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 2)  Add command line argument to the Pig command as follows:
>>>> 
>>>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>>> 
>>>> 
>>>> 
>>>> 3)  Use SmallFileStorage in the Pig script as follows:
>>>> 
>>>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>>> 
>>>> 
>>>> But this doesn't seem to work.  We still see that everything is going to
>>>> one mapper.  Before we spend any more time on this, I am wondering if
>> this
>>>> is a good approach – OR – if there's a better approach?  Please let me
>>>> know.  Thanks.
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.

I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.

David

On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:

> David:  Your suggestion would add an additional step of copying data from
> one place to another.  Not bad, but not ideal.  Is there no way to avoid
> copying of data?
> 
> BTW, we have tried changing the following options to no avail :(
> 
> set pig.splitCombination false;
> 
> & a few other 'dfs' options given below:
> 
> mapreduce.min.split.size
> mapreduce.max.split.size
> 
> Thanks.
> 
> On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> You could store your data in smaller block sizes. Do something like
>> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
>> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
>> You might only need one of those parameters. You can verify the block size
>> with
>> hadoop fsck /small-block-input
>> 
>> In your pig script, you'll probably need to set
>> pig.maxCombinedSplitSize
>> to something around the block size
>> 
>> David
>> 
>> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
>> wrote:
>> 
>>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
>>> HBase.  Adding 'hadoop' user group.
>>> 
>>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We are running into performance issues with Pig/Hadoop because our input
>>>> files are small.  Everything goes to only 1 Mapper.  To get around
>> this, we
>>>> are trying to use our own Loader like this:
>>>> 
>>>> 1)  Extend PigStorage:
>>>> 
>>>> public class SmallFileStorage extends PigStorage {
>>>> 
>>>>   public SmallFileStorage(String delimiter) {
>>>>       super(delimiter);
>>>>   }
>>>> 
>>>>   @Override
>>>>   public InputFormat getInputFormat() {
>>>>       return new NLineInputFormat();
>>>>   }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 2)  Add command line argument to the Pig command as follows:
>>>> 
>>>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>>> 
>>>> 
>>>> 
>>>> 3)  Use SmallFileStorage in the Pig script as follows:
>>>> 
>>>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>>> 
>>>> 
>>>> But this doesn't seem to work.  We still see that everything is going to
>>>> one mapper.  Before we spend any more time on this, I am wondering if
>> this
>>>> is a good approach – OR – if there's a better approach?  Please let me
>>>> know.  Thanks.
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.

I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing.

David

On Feb 11, 2013, at 2:10 PM, Something Something <ma...@gmail.com> wrote:

> David:  Your suggestion would add an additional step of copying data from
> one place to another.  Not bad, but not ideal.  Is there no way to avoid
> copying of data?
> 
> BTW, we have tried changing the following options to no avail :(
> 
> set pig.splitCombination false;
> 
> & a few other 'dfs' options given below:
> 
> mapreduce.min.split.size
> mapreduce.max.split.size
> 
> Thanks.
> 
> On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> You could store your data in smaller block sizes. Do something like
>> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
>> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
>> You might only need one of those parameters. You can verify the block size
>> with
>> hadoop fsck /small-block-input
>> 
>> In your pig script, you'll probably need to set
>> pig.maxCombinedSplitSize
>> to something around the block size
>> 
>> David
>> 
>> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
>> wrote:
>> 
>>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
>>> HBase.  Adding 'hadoop' user group.
>>> 
>>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We are running into performance issues with Pig/Hadoop because our input
>>>> files are small.  Everything goes to only 1 Mapper.  To get around
>> this, we
>>>> are trying to use our own Loader like this:
>>>> 
>>>> 1)  Extend PigStorage:
>>>> 
>>>> public class SmallFileStorage extends PigStorage {
>>>> 
>>>>   public SmallFileStorage(String delimiter) {
>>>>       super(delimiter);
>>>>   }
>>>> 
>>>>   @Override
>>>>   public InputFormat getInputFormat() {
>>>>       return new NLineInputFormat();
>>>>   }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 2)  Add command line argument to the Pig command as follows:
>>>> 
>>>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>>> 
>>>> 
>>>> 
>>>> 3)  Use SmallFileStorage in the Pig script as follows:
>>>> 
>>>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>>> 
>>>> 
>>>> But this doesn't seem to work.  We still see that everything is going to
>>>> one mapper.  Before we spend any more time on this, I am wondering if
>> this
>>>> is a good approach – OR – if there's a better approach?  Please let me
>>>> know.  Thanks.
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

David:  Your suggestion would add an additional step of copying data from
one place to another.  Not bad, but not ideal.  Is there no way to avoid
copying of data?

BTW, we have tried changing the following options to no avail :(

set pig.splitCombination false;

& a few other 'dfs' options given below:

mapreduce.min.split.size
mapreduce.max.split.size

Thanks.

On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> You could store your data in smaller block sizes. Do something like
> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> You might only need one of those parameters. You can verify the block size
> with
> hadoop fsck /small-block-input
>
> In your pig script, you'll probably need to set
> pig.maxCombinedSplitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase.  Adding 'hadoop' user group.
> >
> > On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We are running into performance issues with Pig/Hadoop because our input
> >> files are small.  Everything goes to only 1 Mapper.  To get around
> this, we
> >> are trying to use our own Loader like this:
> >>
> >> 1)  Extend PigStorage:
> >>
> >> public class SmallFileStorage extends PigStorage {
> >>
> >>    public SmallFileStorage(String delimiter) {
> >>        super(delimiter);
> >>    }
> >>
> >>    @Override
> >>    public InputFormat getInputFormat() {
> >>        return new NLineInputFormat();
> >>    }
> >> }
> >>
> >>
> >>
> >> 2)  Add command line argument to the Pig command as follows:
> >>
> >> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>
> >>
> >>
> >> 3)  Use SmallFileStorage in the Pig script as follows:
> >>
> >> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>
> >>
> >> But this doesn't seem to work.  We still see that everything is going to
> >> one mapper.  Before we spend any more time on this, I am wondering if
> this
> >> is a good approach – OR – if there's a better approach?  Please let me
> >> know.  Thanks.
> >>
> >>
> >>
>
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

David:  Your suggestion would add an additional step of copying data from
one place to another.  Not bad, but not ideal.  Is there no way to avoid
copying of data?

BTW, we have tried changing the following options to no avail :(

set pig.splitCombination false;

& a few other 'dfs' options given below:

mapreduce.min.split.size
mapreduce.max.split.size

Thanks.

On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> You could store your data in smaller block sizes. Do something like
> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> You might only need one of those parameters. You can verify the block size
> with
> hadoop fsck /small-block-input
>
> In your pig script, you'll probably need to set
> pig.maxCombinedSplitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase.  Adding 'hadoop' user group.
> >
> > On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We are running into performance issues with Pig/Hadoop because our input
> >> files are small.  Everything goes to only 1 Mapper.  To get around
> this, we
> >> are trying to use our own Loader like this:
> >>
> >> 1)  Extend PigStorage:
> >>
> >> public class SmallFileStorage extends PigStorage {
> >>
> >>    public SmallFileStorage(String delimiter) {
> >>        super(delimiter);
> >>    }
> >>
> >>    @Override
> >>    public InputFormat getInputFormat() {
> >>        return new NLineInputFormat();
> >>    }
> >> }
> >>
> >>
> >>
> >> 2)  Add command line argument to the Pig command as follows:
> >>
> >> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>
> >>
> >>
> >> 3)  Use SmallFileStorage in the Pig script as follows:
> >>
> >> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>
> >>
> >> But this doesn't seem to work.  We still see that everything is going to
> >> one mapper.  Before we spend any more time on this, I am wondering if
> this
> >> is a good approach – OR – if there's a better approach?  Please let me
> >> know.  Thanks.
> >>
> >>
> >>
>
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

David:  Your suggestion would add an additional step of copying data from
one place to another.  Not bad, but not ideal.  Is there no way to avoid
copying of data?

BTW, we have tried changing the following options to no avail :(

set pig.splitCombination false;

& a few other 'dfs' options given below:

mapreduce.min.split.size
mapreduce.max.split.size

Thanks.

On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> You could store your data in smaller block sizes. Do something like
> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> You might only need one of those parameters. You can verify the block size
> with
> hadoop fsck /small-block-input
>
> In your pig script, you'll probably need to set
> pig.maxCombinedSplitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase.  Adding 'hadoop' user group.
> >
> > On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We are running into performance issues with Pig/Hadoop because our input
> >> files are small.  Everything goes to only 1 Mapper.  To get around
> this, we
> >> are trying to use our own Loader like this:
> >>
> >> 1)  Extend PigStorage:
> >>
> >> public class SmallFileStorage extends PigStorage {
> >>
> >>    public SmallFileStorage(String delimiter) {
> >>        super(delimiter);
> >>    }
> >>
> >>    @Override
> >>    public InputFormat getInputFormat() {
> >>        return new NLineInputFormat();
> >>    }
> >> }
> >>
> >>
> >>
> >> 2)  Add command line argument to the Pig command as follows:
> >>
> >> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>
> >>
> >>
> >> 3)  Use SmallFileStorage in the Pig script as follows:
> >>
> >> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>
> >>
> >> But this doesn't seem to work.  We still see that everything is going to
> >> one mapper.  Before we spend any more time on this, I am wondering if
> this
> >> is a good approach – OR – if there's a better approach?  Please let me
> >> know.  Thanks.
> >>
> >>
> >>
>
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

David:  Your suggestion would add an additional step of copying data from
one place to another.  Not bad, but not ideal.  Is there no way to avoid
copying of data?

BTW, we have tried changing the following options to no avail :(

set pig.splitCombination false;

& a few other 'dfs' options given below:

mapreduce.min.split.size
mapreduce.max.split.size

Thanks.

On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> You could store your data in smaller block sizes. Do something like
> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> You might only need one of those parameters. You can verify the block size
> with
> hadoop fsck /small-block-input
>
> In your pig script, you'll probably need to set
> pig.maxCombinedSplitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase.  Adding 'hadoop' user group.
> >
> > On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We are running into performance issues with Pig/Hadoop because our input
> >> files are small.  Everything goes to only 1 Mapper.  To get around
> this, we
> >> are trying to use our own Loader like this:
> >>
> >> 1)  Extend PigStorage:
> >>
> >> public class SmallFileStorage extends PigStorage {
> >>
> >>    public SmallFileStorage(String delimiter) {
> >>        super(delimiter);
> >>    }
> >>
> >>    @Override
> >>    public InputFormat getInputFormat() {
> >>        return new NLineInputFormat();
> >>    }
> >> }
> >>
> >>
> >>
> >> 2)  Add command line argument to the Pig command as follows:
> >>
> >> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>
> >>
> >>
> >> 3)  Use SmallFileStorage in the Pig script as follows:
> >>
> >> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>
> >>
> >> But this doesn't seem to work.  We still see that everything is going to
> >> one mapper.  Before we spend any more time on this, I am wondering if
> this
> >> is a good approach – OR – if there's a better approach?  Please let me
> >> know.  Thanks.
> >>
> >>
> >>
>
>

Re: Loader for small files

Posted by Something Something <ma...@gmail.com>.

David:  Your suggestion would add an additional step of copying data from
one place to another.  Not bad, but not ideal.  Is there no way to avoid
copying of data?

BTW, we have tried changing the following options to no avail :(

set pig.splitCombination false;

& a few other 'dfs' options given below:

mapreduce.min.split.size
mapreduce.max.split.size

Thanks.

On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> You could store your data in smaller block sizes. Do something like
> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> You might only need one of those parameters. You can verify the block size
> with
> hadoop fsck /small-block-input
>
> In your pig script, you'll probably need to set
> pig.maxCombinedSplitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com>
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase.  Adding 'hadoop' user group.
> >
> > On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We are running into performance issues with Pig/Hadoop because our input
> >> files are small.  Everything goes to only 1 Mapper.  To get around
> this, we
> >> are trying to use our own Loader like this:
> >>
> >> 1)  Extend PigStorage:
> >>
> >> public class SmallFileStorage extends PigStorage {
> >>
> >>    public SmallFileStorage(String delimiter) {
> >>        super(delimiter);
> >>    }
> >>
> >>    @Override
> >>    public InputFormat getInputFormat() {
> >>        return new NLineInputFormat();
> >>    }
> >> }
> >>
> >>
> >>
> >> 2)  Add command line argument to the Pig command as follows:
> >>
> >> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>
> >>
> >>
> >> 3)  Use SmallFileStorage in the Pig script as follows:
> >>
> >> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>
> >>
> >> But this doesn't seem to work.  We still see that everything is going to
> >> one mapper.  Before we spend any more time on this, I am wondering if
> this
> >> is a good approach – OR – if there's a better approach?  Please let me
> >> know.  Thanks.
> >>
> >>
> >>
>
>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize 
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
> 
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> Hello,
>> 
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>> 
>> 1)  Extend PigStorage:
>> 
>> public class SmallFileStorage extends PigStorage {
>> 
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>> 
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>> 
>> 
>> 
>> 2)  Add command line argument to the Pig command as follows:
>> 
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>> 
>> 
>> 
>> 3)  Use SmallFileStorage in the Pig script as follows:
>> 
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>> 
>> 
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize 
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
> 
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> Hello,
>> 
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>> 
>> 1)  Extend PigStorage:
>> 
>> public class SmallFileStorage extends PigStorage {
>> 
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>> 
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>> 
>> 
>> 
>> 2)  Add command line argument to the Pig command as follows:
>> 
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>> 
>> 
>> 
>> 3)  Use SmallFileStorage in the Pig script as follows:
>> 
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>> 
>> 
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize 
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
> 
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> Hello,
>> 
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>> 
>> 1)  Extend PigStorage:
>> 
>> public class SmallFileStorage extends PigStorage {
>> 
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>> 
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>> 
>> 
>> 
>> 2)  Add command line argument to the Pig command as follows:
>> 
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>> 
>> 
>> 
>> 3)  Use SmallFileStorage in the Pig script as follows:
>> 
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>> 
>> 
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize 
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
> 
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> Hello,
>> 
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>> 
>> 1)  Extend PigStorage:
>> 
>> public class SmallFileStorage extends PigStorage {
>> 
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>> 
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>> 
>> 
>> 
>> 2)  Add command line argument to the Pig command as follows:
>> 
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>> 
>> 
>> 
>> 3)  Use SmallFileStorage in the Pig script as follows:
>> 
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>> 
>> 
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>> 
>> 
>>

Re: Loader for small files

Posted by David LaBarbera <da...@localresponse.com>.

You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize 
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <ma...@gmail.com> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
> 
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> Hello,
>> 
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>> 
>> 1)  Extend PigStorage:
>> 
>> public class SmallFileStorage extends PigStorage {
>> 
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>> 
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>> 
>> 
>> 
>> 2)  Add command line argument to the Pig command as follows:
>> 
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>> 
>> 
>> 
>> 3)  Use SmallFileStorage in the Pig script as follows:
>> 
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>> 
>> 
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>> 
>> 
>>