You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Peter Skomoroch <pe...@gmail.com> on 2009/03/28 09:51:14 UTC

Hadoop streaming performance: elements vs. vectors

Hadoop streaming question: If I am forming a matrix M by summing a number of
elements generated on different mappers, is it better to emit tons of lines
from the mappers with small key,value pairs for each element, or should I
group them into row vectors before sending to the reducers?

For example, say I'm summing frequency count matrices M for each user on a
different map task, and the reducer combines the resulting sparse user count
matrices for use in another calculation.

Should I emit the individual elements:

i (j, Mij) \n
3 (1, 3.4) \n
3 (2, 3.4) \n
3 (3, 3.4) \n
4 (1, 2.3) \n
4 (2, 5.2) \n

Or posting list style vectors?

3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
4 ((1, 2.3), (2, 5.2)) \n

Using vectors will at least save some message space, but are there any other
benefits to this approach in terms of Hadoop streaming overhead (sorts
etc.)?  I think buffering issues will not be a huge concern since the length
of the vectors have a reasonable upper bound and will be in a sparse
format...


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Posted by Peter Skomoroch <pe...@gmail.com>.
Amareshwari,

Thanks for the suggestion, can you show a streaming jobconf that uses
"mapred.job.classpath.archives" to add a custom combiner to the classpath?

I've tried several variations, but the jar doesn't seem to get added to the
classpath properly...

-Pete

On Mon, Apr 6, 2009 at 12:17 AM, Amareshwari Sriramadasu <
amarsri@yahoo-inc.com> wrote:

> You can add your jar to distributed cache and add it to classpath by
> passing it in configuration propery - "mapred.job.classpath.archives".
>
> -Amareshwari
>
> Peter Skomoroch wrote:
>
>> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there
>> a
>> way to add it to the classpath without the following patch?
>>
>> https://issues.apache.org/jira/browse/HADOOP-3570
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E
>>
>> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
>> <pe...@gmail.com>wrote:
>>
>>
>>
>>> Paco,
>>>
>>> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as
>>> you
>>> suggest and report back later...
>>>
>>> -Pete
>>>
>>>
>>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:
>>>
>>>
>>>
>>>> hi peter,
>>>> thinking aloud on this -
>>>>
>>>> trade-offs may depend on:
>>>>
>>>>  * how much grouping would be possible (tracking a PDF would be
>>>> interesting for metrics)
>>>>  * locality of key/value pairs (distributed among mapper and reducer
>>>> tasks)
>>>>
>>>> to that point, will there be much time spent in the shuffle?  if so,
>>>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>>>> many small key,value pair
>>>>
>>>> in any case, when i had a similar situation on a large data set (2-3
>>>> Tb shuffle) a good pattern to follow was:
>>>>
>>>>  * mapper emitted small key,value pairs
>>>>  * combiner grouped into row vectors
>>>>
>>>> that combiner may get invoked both at the end of the map phase and at
>>>> the beginning of the reduce phase (more benefit)
>>>>
>>>> also, using byte arrays if possible to represent values may be able to
>>>> save much shuffle time
>>>>
>>>> best,
>>>> paco
>>>>
>>>>
>>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>>>> <pe...@gmail.com> wrote:
>>>>
>>>>
>>>>> Hadoop streaming question: If I am forming a matrix M by summing a
>>>>>
>>>>>
>>>> number of
>>>>
>>>>
>>>>> elements generated on different mappers, is it better to emit tons of
>>>>>
>>>>>
>>>> lines
>>>>
>>>>
>>>>> from the mappers with small key,value pairs for each element, or should
>>>>>
>>>>>
>>>> I
>>>>
>>>>
>>>>> group them into row vectors before sending to the reducers?
>>>>>
>>>>> For example, say I'm summing frequency count matrices M for each user
>>>>> on
>>>>>
>>>>>
>>>> a
>>>>
>>>>
>>>>> different map task, and the reducer combines the resulting sparse user
>>>>>
>>>>>
>>>> count
>>>>
>>>>
>>>>> matrices for use in another calculation.
>>>>>
>>>>> Should I emit the individual elements:
>>>>>
>>>>> i (j, Mij) \n
>>>>> 3 (1, 3.4) \n
>>>>> 3 (2, 3.4) \n
>>>>> 3 (3, 3.4) \n
>>>>> 4 (1, 2.3) \n
>>>>> 4 (2, 5.2) \n
>>>>>
>>>>> Or posting list style vectors?
>>>>>
>>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>>>>> 4 ((1, 2.3), (2, 5.2)) \n
>>>>>
>>>>> Using vectors will at least save some message space, but are there any
>>>>>
>>>>>
>>>> other
>>>>
>>>>
>>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts
>>>>> etc.)?  I think buffering issues will not be a huge concern since the
>>>>>
>>>>>
>>>> length
>>>>
>>>>
>>>>> of the vectors have a reasonable upper bound and will be in a sparse
>>>>> format...
>>>>>
>>>>>
>>>>> --
>>>>> Peter N. Skomoroch
>>>>> 617.285.8348
>>>>> http://www.datawrangling.com
>>>>> http://delicious.com/pskomoroch
>>>>> http://twitter.com/peteskomoroch
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Peter N. Skomoroch
>>> 617.285.8348
>>> http://www.datawrangling.com
>>> http://delicious.com/pskomoroch
>>> http://twitter.com/peteskomoroch
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
You can add your jar to distributed cache and add it to classpath by 
passing it in configuration propery - "mapred.job.classpath.archives".

-Amareshwari
Peter Skomoroch wrote:
> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
> way to add it to the classpath without the following patch?
>
> https://issues.apache.org/jira/browse/HADOOP-3570
>
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E
>
> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
> <pe...@gmail.com>wrote:
>
>   
>> Paco,
>>
>> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
>> suggest and report back later...
>>
>> -Pete
>>
>>
>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:
>>
>>     
>>> hi peter,
>>> thinking aloud on this -
>>>
>>> trade-offs may depend on:
>>>
>>>   * how much grouping would be possible (tracking a PDF would be
>>> interesting for metrics)
>>>   * locality of key/value pairs (distributed among mapper and reducer
>>> tasks)
>>>
>>> to that point, will there be much time spent in the shuffle?  if so,
>>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>>> many small key,value pair
>>>
>>> in any case, when i had a similar situation on a large data set (2-3
>>> Tb shuffle) a good pattern to follow was:
>>>
>>>   * mapper emitted small key,value pairs
>>>   * combiner grouped into row vectors
>>>
>>> that combiner may get invoked both at the end of the map phase and at
>>> the beginning of the reduce phase (more benefit)
>>>
>>> also, using byte arrays if possible to represent values may be able to
>>> save much shuffle time
>>>
>>> best,
>>> paco
>>>
>>>
>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>>> <pe...@gmail.com> wrote:
>>>       
>>>> Hadoop streaming question: If I am forming a matrix M by summing a
>>>>         
>>> number of
>>>       
>>>> elements generated on different mappers, is it better to emit tons of
>>>>         
>>> lines
>>>       
>>>> from the mappers with small key,value pairs for each element, or should
>>>>         
>>> I
>>>       
>>>> group them into row vectors before sending to the reducers?
>>>>
>>>> For example, say I'm summing frequency count matrices M for each user on
>>>>         
>>> a
>>>       
>>>> different map task, and the reducer combines the resulting sparse user
>>>>         
>>> count
>>>       
>>>> matrices for use in another calculation.
>>>>
>>>> Should I emit the individual elements:
>>>>
>>>> i (j, Mij) \n
>>>> 3 (1, 3.4) \n
>>>> 3 (2, 3.4) \n
>>>> 3 (3, 3.4) \n
>>>> 4 (1, 2.3) \n
>>>> 4 (2, 5.2) \n
>>>>
>>>> Or posting list style vectors?
>>>>
>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>>>> 4 ((1, 2.3), (2, 5.2)) \n
>>>>
>>>> Using vectors will at least save some message space, but are there any
>>>>         
>>> other
>>>       
>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts
>>>> etc.)?  I think buffering issues will not be a huge concern since the
>>>>         
>>> length
>>>       
>>>> of the vectors have a reasonable upper bound and will be in a sparse
>>>> format...
>>>>
>>>>
>>>> --
>>>> Peter N. Skomoroch
>>>> 617.285.8348
>>>> http://www.datawrangling.com
>>>> http://delicious.com/pskomoroch
>>>> http://twitter.com/peteskomoroch
>>>>
>>>>         
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>>     
>
>
>
>   


Re: Hadoop streaming performance: elements vs. vectors

Posted by Peter Skomoroch <pe...@gmail.com>.
If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
way to add it to the classpath without the following patch?

https://issues.apache.org/jira/browse/HADOOP-3570

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E

On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
<pe...@gmail.com>wrote:

> Paco,
>
> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
> suggest and report back later...
>
> -Pete
>
>
> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:
>
>> hi peter,
>> thinking aloud on this -
>>
>> trade-offs may depend on:
>>
>>   * how much grouping would be possible (tracking a PDF would be
>> interesting for metrics)
>>   * locality of key/value pairs (distributed among mapper and reducer
>> tasks)
>>
>> to that point, will there be much time spent in the shuffle?  if so,
>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>> many small key,value pair
>>
>> in any case, when i had a similar situation on a large data set (2-3
>> Tb shuffle) a good pattern to follow was:
>>
>>   * mapper emitted small key,value pairs
>>   * combiner grouped into row vectors
>>
>> that combiner may get invoked both at the end of the map phase and at
>> the beginning of the reduce phase (more benefit)
>>
>> also, using byte arrays if possible to represent values may be able to
>> save much shuffle time
>>
>> best,
>> paco
>>
>>
>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>> <pe...@gmail.com> wrote:
>> > Hadoop streaming question: If I am forming a matrix M by summing a
>> number of
>> > elements generated on different mappers, is it better to emit tons of
>> lines
>> > from the mappers with small key,value pairs for each element, or should
>> I
>> > group them into row vectors before sending to the reducers?
>> >
>> > For example, say I'm summing frequency count matrices M for each user on
>> a
>> > different map task, and the reducer combines the resulting sparse user
>> count
>> > matrices for use in another calculation.
>> >
>> > Should I emit the individual elements:
>> >
>> > i (j, Mij) \n
>> > 3 (1, 3.4) \n
>> > 3 (2, 3.4) \n
>> > 3 (3, 3.4) \n
>> > 4 (1, 2.3) \n
>> > 4 (2, 5.2) \n
>> >
>> > Or posting list style vectors?
>> >
>> > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>> > 4 ((1, 2.3), (2, 5.2)) \n
>> >
>> > Using vectors will at least save some message space, but are there any
>> other
>> > benefits to this approach in terms of Hadoop streaming overhead (sorts
>> > etc.)?  I think buffering issues will not be a huge concern since the
>> length
>> > of the vectors have a reasonable upper bound and will be in a sparse
>> > format...
>> >
>> >
>> > --
>> > Peter N. Skomoroch
>> > 617.285.8348
>> > http://www.datawrangling.com
>> > http://delicious.com/pskomoroch
>> > http://twitter.com/peteskomoroch
>> >
>>
>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Posted by Peter Skomoroch <pe...@gmail.com>.
Paco,

Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
suggest and report back later...

-Pete

On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:

> hi peter,
> thinking aloud on this -
>
> trade-offs may depend on:
>
>   * how much grouping would be possible (tracking a PDF would be
> interesting for metrics)
>   * locality of key/value pairs (distributed among mapper and reducer
> tasks)
>
> to that point, will there be much time spent in the shuffle?  if so,
> it's probably cheaper to shuffle/sort the grouped row vectors than the
> many small key,value pair
>
> in any case, when i had a similar situation on a large data set (2-3
> Tb shuffle) a good pattern to follow was:
>
>   * mapper emitted small key,value pairs
>   * combiner grouped into row vectors
>
> that combiner may get invoked both at the end of the map phase and at
> the beginning of the reduce phase (more benefit)
>
> also, using byte arrays if possible to represent values may be able to
> save much shuffle time
>
> best,
> paco
>
>
> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
> <pe...@gmail.com> wrote:
> > Hadoop streaming question: If I am forming a matrix M by summing a number
> of
> > elements generated on different mappers, is it better to emit tons of
> lines
> > from the mappers with small key,value pairs for each element, or should I
> > group them into row vectors before sending to the reducers?
> >
> > For example, say I'm summing frequency count matrices M for each user on
> a
> > different map task, and the reducer combines the resulting sparse user
> count
> > matrices for use in another calculation.
> >
> > Should I emit the individual elements:
> >
> > i (j, Mij) \n
> > 3 (1, 3.4) \n
> > 3 (2, 3.4) \n
> > 3 (3, 3.4) \n
> > 4 (1, 2.3) \n
> > 4 (2, 5.2) \n
> >
> > Or posting list style vectors?
> >
> > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
> > 4 ((1, 2.3), (2, 5.2)) \n
> >
> > Using vectors will at least save some message space, but are there any
> other
> > benefits to this approach in terms of Hadoop streaming overhead (sorts
> > etc.)?  I think buffering issues will not be a huge concern since the
> length
> > of the vectors have a reasonable upper bound and will be in a sparse
> > format...
> >
> >
> > --
> > Peter N. Skomoroch
> > 617.285.8348
> > http://www.datawrangling.com
> > http://delicious.com/pskomoroch
> > http://twitter.com/peteskomoroch
> >
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Posted by Paco NATHAN <ce...@gmail.com>.
hi peter,
thinking aloud on this -

trade-offs may depend on:

   * how much grouping would be possible (tracking a PDF would be
interesting for metrics)
   * locality of key/value pairs (distributed among mapper and reducer tasks)

to that point, will there be much time spent in the shuffle?  if so,
it's probably cheaper to shuffle/sort the grouped row vectors than the
many small key,value pair

in any case, when i had a similar situation on a large data set (2-3
Tb shuffle) a good pattern to follow was:

   * mapper emitted small key,value pairs
   * combiner grouped into row vectors

that combiner may get invoked both at the end of the map phase and at
the beginning of the reduce phase (more benefit)

also, using byte arrays if possible to represent values may be able to
save much shuffle time

best,
paco


On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
<pe...@gmail.com> wrote:
> Hadoop streaming question: If I am forming a matrix M by summing a number of
> elements generated on different mappers, is it better to emit tons of lines
> from the mappers with small key,value pairs for each element, or should I
> group them into row vectors before sending to the reducers?
>
> For example, say I'm summing frequency count matrices M for each user on a
> different map task, and the reducer combines the resulting sparse user count
> matrices for use in another calculation.
>
> Should I emit the individual elements:
>
> i (j, Mij) \n
> 3 (1, 3.4) \n
> 3 (2, 3.4) \n
> 3 (3, 3.4) \n
> 4 (1, 2.3) \n
> 4 (2, 5.2) \n
>
> Or posting list style vectors?
>
> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
> 4 ((1, 2.3), (2, 5.2)) \n
>
> Using vectors will at least save some message space, but are there any other
> benefits to this approach in terms of Hadoop streaming overhead (sorts
> etc.)?  I think buffering issues will not be a huge concern since the length
> of the vectors have a reasonable upper bound and will be in a sparse
> format...
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>