You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Vadim Zaliva <lo...@codeminders.com> on 2009/02/10 19:39:24 UTC

controlling split

I am processing about 4GB of data on 4-node Hadoop cluster using Pig.  
The first
MAP job it executes generates 80K map tasks. I wonder if this number  
is a bit excessive.
Does such task granularity makes sense? Can I adjust this without  
writing custom
loaders?


Sincerelu,
Vadim

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)

Re: controlling split

Posted by Alan Gates <ga...@yahoo-inc.com>.

By default pig creates one map task per HDFS block.  This is generally  
what you want since hadoop is good at getting that map task on or near  
a node that has the data.  But if you're job is creating 80k nodes for  
8gb of data, that means it thinks it has a block size of ~50k, which  
would be really small (I think the default is 64M, which mean you  
should get 128 maps).  You should check your configuration file on the  
machine you're launching pig on to see if the block size is specified  
and if so what it's set to.

Alan.

On Feb 10, 2009, at 10:39 AM, Vadim Zaliva wrote:

> I am processing about 4GB of data on 4-node Hadoop cluster using  
> Pig. The first
> MAP job it executes generates 80K map tasks. I wonder if this number  
> is a bit excessive.
> Does such task granularity makes sense? Can I adjust this without  
> writing custom
> loaders?
>
>
> Sincerelu,
> Vadim
>
> --
> "La perfection est atteinte non quand il ne reste rien a ajouter, mais
> quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)
>
>
>

Re: controlling split

Posted by Vadim Zaliva <lo...@codeminders.com>.

On Feb 11, 2009, at 23:38 , Kevin Weil wrote:

> Perhaps also consider putting something between the HDFS and  
> whatever is
> writing all your small files that can aggregate them into larger  
> chunks.
> You can separate different messages with a separator guid or  
> similar, tuning
> the length according to your error tolerance, until the file gets to a
> decent size (on the order of your block size is desirable, right?  So
> 64-128M).  Then rotate the files and begin packing your smaller  
> messages
> into a new block.  We do this prior to putting the data in the HDFS,  
> so no
> MR is required.

This is pretty much what I did. After aggregating files prior to
adding them to HDFS I got almost 10x performance increase.

Vadim

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)

Re: controlling split

Posted by Kevin Weil <ke...@gmail.com>.

Perhaps also consider putting something between the HDFS and whatever is
writing all your small files that can aggregate them into larger chunks.
You can separate different messages with a separator guid or similar, tuning
the length according to your error tolerance, until the file gets to a
decent size (on the order of your block size is desirable, right?  So
64-128M).  Then rotate the files and begin packing your smaller messages
into a new block.  We do this prior to putting the data in the HDFS, so no
MR is required.  You end with block-sized files, and it's pretty
straightforward to write a custom LoadFunc in Pig to deal with your new
packed files.  If you use a good stream searching algorithm like
Knuth-Morris-Pratt to move between separators, this approach ends up having
pretty good performance.  Certainly better than having 80k maps!

You can avoid writing your own slicer if you always walk slightly beyond the
LoadFunc's given file segment in bindTo to end the read at the separator
following the boundary.

Kevin

On Wed, Feb 11, 2009 at 9:25 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> You're correct that the issue with aggregating in the slicer is that you're
> reads won't be efficient because they'll mostly be over the network.  The
> downside of having 80k maps is it will take hadoop several minutes to start
> the job and possibly run your JobTracker out of memory.
>
> I'm not an ops guy, so this may be a totally wrong answer.  But my thinking
> would be that if you're going to read this data in consolidated form many
> times, you should run one map reduce job just to consolidate it and then run
> your many pig jobs over the consolidated data.  You'll have to write a java
> map reduce job to do the consolidation since pig won't force an empty reduce
> job (which you'll need to redistribute the data).  This also assumes you
> have enough disk space to store your data twice.
>
> Alan.
>
>
>
> On Feb 10, 2009, at 11:28 AM, Vadim Zaliva wrote:
>
>
>> On Feb 10, 2009, at 11:07 , Olga Natkovich wrote:
>>
>>  By default, files don't get consolidated. You would need a custom slicer
>>> to accomplish this.
>>>
>>
>>
>> I am not sure it would be very efficient, since combining files which
>> are potentially located on different cluster machines would involve
>> lot of copying.
>>
>> What is the best way to do this? I have data coming automatically
>> every hour. Should I try to consolidate it on import to DFS or
>> custom slicer would be as good solution?
>>
>> Vadim
>>
>>
>> --
>> "La perfection est atteinte non quand il ne reste rien a ajouter, mais
>> quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)
>>
>>
>>
>>
>

Re: controlling split

Posted by Alan Gates <ga...@yahoo-inc.com>.

You're correct that the issue with aggregating in the slicer is that  
you're reads won't be efficient because they'll mostly be over the  
network.  The downside of having 80k maps is it will take hadoop  
several minutes to start the job and possibly run your JobTracker out  
of memory.

I'm not an ops guy, so this may be a totally wrong answer.  But my  
thinking would be that if you're going to read this data in  
consolidated form many times, you should run one map reduce job just  
to consolidate it and then run your many pig jobs over the  
consolidated data.  You'll have to write a java map reduce job to do  
the consolidation since pig won't force an empty reduce job (which  
you'll need to redistribute the data).  This also assumes you have  
enough disk space to store your data twice.

Alan.

On Feb 10, 2009, at 11:28 AM, Vadim Zaliva wrote:

>
> On Feb 10, 2009, at 11:07 , Olga Natkovich wrote:
>
>> By default, files don't get consolidated. You would need a custom  
>> slicer
>> to accomplish this.
>
>
> I am not sure it would be very efficient, since combining files which
> are potentially located on different cluster machines would involve
> lot of copying.
>
> What is the best way to do this? I have data coming automatically
> every hour. Should I try to consolidate it on import to DFS or
> custom slicer would be as good solution?
>
> Vadim
>
>
> --
> "La perfection est atteinte non quand il ne reste rien a ajouter, mais
> quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)
>
>
>

Re: controlling split

Posted by Vadim Zaliva <lo...@codeminders.com>.

On Feb 10, 2009, at 11:07 , Olga Natkovich wrote:

> By default, files don't get consolidated. You would need a custom  
> slicer
> to accomplish this.

I am not sure it would be very efficient, since combining files which
are potentially located on different cluster machines would involve
lot of copying.

What is the best way to do this? I have data coming automatically
every hour. Should I try to consolidate it on import to DFS or
custom slicer would be as good solution?

Vadim

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)

RE: controlling split

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

By default, files don't get consolidated. You would need a custom slicer
to accomplish this.

Olga 

> -----Original Message-----
> From: Vadim Zaliva [mailto:lord@codeminders.com] 
> Sent: Tuesday, February 10, 2009 10:59 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: controlling split
> 
> 
> On Feb 10, 2009, at 10:44 , Yiping Han wrote:
> 
> > Can you check what is the block size of your input data? 
> For 4GB data, 
> > 80K map tasks does not make sense.
> 
> 
> I have a lot of small files about (80K), which I am loading using
> wildcard:
> 
> raw = LOAD '/dfs/data/*' USING PigStorage('\t') ...
> 
> I was assuming it would consolidate and tread as a single 
> dataset? I guess not.
> 
> If there is a way to remedy this?
> 
> Vadim
> 
> --
> "La perfection est atteinte non quand il ne reste rien a 
> ajouter, mais quand il ne reste rien a enlever."  (Antoine de 
> Saint-Exupery)
> 
> 
> 
>

Re: controlling split

Posted by Vadim Zaliva <lo...@codeminders.com>.

On Feb 10, 2009, at 10:44 , Yiping Han wrote:

> Can you check what is the block size of your input data? For 4GB  
> data, 80K
> map tasks does not make sense.

I have a lot of small files about (80K), which I am loading using  
wildcard:

raw = LOAD '/dfs/data/*' USING PigStorage('\t') ...

I was assuming it would consolidate and tread as a single dataset? I  
guess not.

If there is a way to remedy this?

Vadim

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)

Re: controlling split

Posted by Yiping Han <yh...@yahoo-inc.com>.

Vadim,

Can you check what is the block size of your input data? For 4GB data, 80K
map tasks does not make sense.

--Yiping


On 2/10/09 10:39 AM, "Vadim Zaliva" <lo...@codeminders.com> wrote:

> I am processing about 4GB of data on 4-node Hadoop cluster using Pig.
> The first
> MAP job it executes generates 80K map tasks. I wonder if this number
> is a bit excessive.
> Does such task granularity makes sense? Can I adjust this without
> writing custom
> loaders?
> 
> 
> Sincerelu,
> Vadim
> 
> --
> "La perfection est atteinte non quand il ne reste rien a ajouter, mais
> quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)
> 
> 
> 

----------------------
Yiping Han
2MC 8127
2811 Mission College Blvd.,
Santa Clara, CA 95054
(408)349-4403
yhan@yahoo-inc.com