You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Peter Skomoroch <pe...@gmail.com> on 2009/04/06 04:36:17 UTC

Re: Hadoop streaming performance: elements vs. vectors

If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
way to add it to the classpath without the following patch?

https://issues.apache.org/jira/browse/HADOOP-3570

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E

On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
<pe...@gmail.com>wrote:

> Paco,
>
> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
> suggest and report back later...
>
> -Pete
>
>
> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:
>
>> hi peter,
>> thinking aloud on this -
>>
>> trade-offs may depend on:
>>
>>   * how much grouping would be possible (tracking a PDF would be
>> interesting for metrics)
>>   * locality of key/value pairs (distributed among mapper and reducer
>> tasks)
>>
>> to that point, will there be much time spent in the shuffle?  if so,
>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>> many small key,value pair
>>
>> in any case, when i had a similar situation on a large data set (2-3
>> Tb shuffle) a good pattern to follow was:
>>
>>   * mapper emitted small key,value pairs
>>   * combiner grouped into row vectors
>>
>> that combiner may get invoked both at the end of the map phase and at
>> the beginning of the reduce phase (more benefit)
>>
>> also, using byte arrays if possible to represent values may be able to
>> save much shuffle time
>>
>> best,
>> paco
>>
>>
>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>> <pe...@gmail.com> wrote:
>> > Hadoop streaming question: If I am forming a matrix M by summing a
>> number of
>> > elements generated on different mappers, is it better to emit tons of
>> lines
>> > from the mappers with small key,value pairs for each element, or should
>> I
>> > group them into row vectors before sending to the reducers?
>> >
>> > For example, say I'm summing frequency count matrices M for each user on
>> a
>> > different map task, and the reducer combines the resulting sparse user
>> count
>> > matrices for use in another calculation.
>> >
>> > Should I emit the individual elements:
>> >
>> > i (j, Mij) \n
>> > 3 (1, 3.4) \n
>> > 3 (2, 3.4) \n
>> > 3 (3, 3.4) \n
>> > 4 (1, 2.3) \n
>> > 4 (2, 5.2) \n
>> >
>> > Or posting list style vectors?
>> >
>> > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>> > 4 ((1, 2.3), (2, 5.2)) \n
>> >
>> > Using vectors will at least save some message space, but are there any
>> other
>> > benefits to this approach in terms of Hadoop streaming overhead (sorts
>> > etc.)?  I think buffering issues will not be a huge concern since the
>> length
>> > of the vectors have a reasonable upper bound and will be in a sparse
>> > format...
>> >
>> >
>> > --
>> > Peter N. Skomoroch
>> > 617.285.8348
>> > http://www.datawrangling.com
>> > http://delicious.com/pskomoroch
>> > http://twitter.com/peteskomoroch
>> >
>>
>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Posted by Peter Skomoroch <pe...@gmail.com>.

Amareshwari,

Thanks for the suggestion, can you show a streaming jobconf that uses
"mapred.job.classpath.archives" to add a custom combiner to the classpath?

I've tried several variations, but the jar doesn't seem to get added to the
classpath properly...

-Pete

On Mon, Apr 6, 2009 at 12:17 AM, Amareshwari Sriramadasu <
amarsri@yahoo-inc.com> wrote:

> You can add your jar to distributed cache and add it to classpath by
> passing it in configuration propery - "mapred.job.classpath.archives".
>
> -Amareshwari
>
> Peter Skomoroch wrote:
>
>> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there
>> a
>> way to add it to the classpath without the following patch?
>>
>> https://issues.apache.org/jira/browse/HADOOP-3570
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E
>>
>> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
>> <pe...@gmail.com>wrote:
>>
>>
>>
>>> Paco,
>>>
>>> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as
>>> you
>>> suggest and report back later...
>>>
>>> -Pete
>>>
>>>
>>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:
>>>
>>>
>>>
>>>> hi peter,
>>>> thinking aloud on this -
>>>>
>>>> trade-offs may depend on:
>>>>
>>>>  * how much grouping would be possible (tracking a PDF would be
>>>> interesting for metrics)
>>>>  * locality of key/value pairs (distributed among mapper and reducer
>>>> tasks)
>>>>
>>>> to that point, will there be much time spent in the shuffle?  if so,
>>>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>>>> many small key,value pair
>>>>
>>>> in any case, when i had a similar situation on a large data set (2-3
>>>> Tb shuffle) a good pattern to follow was:
>>>>
>>>>  * mapper emitted small key,value pairs
>>>>  * combiner grouped into row vectors
>>>>
>>>> that combiner may get invoked both at the end of the map phase and at
>>>> the beginning of the reduce phase (more benefit)
>>>>
>>>> also, using byte arrays if possible to represent values may be able to
>>>> save much shuffle time
>>>>
>>>> best,
>>>> paco
>>>>
>>>>
>>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>>>> <pe...@gmail.com> wrote:
>>>>
>>>>
>>>>> Hadoop streaming question: If I am forming a matrix M by summing a
>>>>>
>>>>>
>>>> number of
>>>>
>>>>
>>>>> elements generated on different mappers, is it better to emit tons of
>>>>>
>>>>>
>>>> lines
>>>>
>>>>
>>>>> from the mappers with small key,value pairs for each element, or should
>>>>>
>>>>>
>>>> I
>>>>
>>>>
>>>>> group them into row vectors before sending to the reducers?
>>>>>
>>>>> For example, say I'm summing frequency count matrices M for each user
>>>>> on
>>>>>
>>>>>
>>>> a
>>>>
>>>>
>>>>> different map task, and the reducer combines the resulting sparse user
>>>>>
>>>>>
>>>> count
>>>>
>>>>
>>>>> matrices for use in another calculation.
>>>>>
>>>>> Should I emit the individual elements:
>>>>>
>>>>> i (j, Mij) \n
>>>>> 3 (1, 3.4) \n
>>>>> 3 (2, 3.4) \n
>>>>> 3 (3, 3.4) \n
>>>>> 4 (1, 2.3) \n
>>>>> 4 (2, 5.2) \n
>>>>>
>>>>> Or posting list style vectors?
>>>>>
>>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>>>>> 4 ((1, 2.3), (2, 5.2)) \n
>>>>>
>>>>> Using vectors will at least save some message space, but are there any
>>>>>
>>>>>
>>>> other
>>>>
>>>>
>>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts
>>>>> etc.)?  I think buffering issues will not be a huge concern since the
>>>>>
>>>>>
>>>> length
>>>>
>>>>
>>>>> of the vectors have a reasonable upper bound and will be in a sparse
>>>>> format...
>>>>>
>>>>>
>>>>> --
>>>>> Peter N. Skomoroch
>>>>> 617.285.8348
>>>>> http://www.datawrangling.com
>>>>> http://delicious.com/pskomoroch
>>>>> http://twitter.com/peteskomoroch
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Peter N. Skomoroch
>>> 617.285.8348
>>> http://www.datawrangling.com
>>> http://delicious.com/pskomoroch
>>> http://twitter.com/peteskomoroch
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

You can add your jar to distributed cache and add it to classpath by 
passing it in configuration propery - "mapred.job.classpath.archives".

-Amareshwari
Peter Skomoroch wrote:
> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
> way to add it to the classpath without the following patch?
>
> https://issues.apache.org/jira/browse/HADOOP-3570
>
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E
>
> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
> <pe...@gmail.com>wrote:
>
>   
>> Paco,
>>
>> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
>> suggest and report back later...
>>
>> -Pete
>>
>>
>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ce...@gmail.com> wrote:
>>
>>     
>>> hi peter,
>>> thinking aloud on this -
>>>
>>> trade-offs may depend on:
>>>
>>>   * how much grouping would be possible (tracking a PDF would be
>>> interesting for metrics)
>>>   * locality of key/value pairs (distributed among mapper and reducer
>>> tasks)
>>>
>>> to that point, will there be much time spent in the shuffle?  if so,
>>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>>> many small key,value pair
>>>
>>> in any case, when i had a similar situation on a large data set (2-3
>>> Tb shuffle) a good pattern to follow was:
>>>
>>>   * mapper emitted small key,value pairs
>>>   * combiner grouped into row vectors
>>>
>>> that combiner may get invoked both at the end of the map phase and at
>>> the beginning of the reduce phase (more benefit)
>>>
>>> also, using byte arrays if possible to represent values may be able to
>>> save much shuffle time
>>>
>>> best,
>>> paco
>>>
>>>
>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>>> <pe...@gmail.com> wrote:
>>>       
>>>> Hadoop streaming question: If I am forming a matrix M by summing a
>>>>         
>>> number of
>>>       
>>>> elements generated on different mappers, is it better to emit tons of
>>>>         
>>> lines
>>>       
>>>> from the mappers with small key,value pairs for each element, or should
>>>>         
>>> I
>>>       
>>>> group them into row vectors before sending to the reducers?
>>>>
>>>> For example, say I'm summing frequency count matrices M for each user on
>>>>         
>>> a
>>>       
>>>> different map task, and the reducer combines the resulting sparse user
>>>>         
>>> count
>>>       
>>>> matrices for use in another calculation.
>>>>
>>>> Should I emit the individual elements:
>>>>
>>>> i (j, Mij) \n
>>>> 3 (1, 3.4) \n
>>>> 3 (2, 3.4) \n
>>>> 3 (3, 3.4) \n
>>>> 4 (1, 2.3) \n
>>>> 4 (2, 5.2) \n
>>>>
>>>> Or posting list style vectors?
>>>>
>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>>>> 4 ((1, 2.3), (2, 5.2)) \n
>>>>
>>>> Using vectors will at least save some message space, but are there any
>>>>         
>>> other
>>>       
>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts
>>>> etc.)?  I think buffering issues will not be a huge concern since the
>>>>         
>>> length
>>>       
>>>> of the vectors have a reasonable upper bound and will be in a sparse
>>>> format...
>>>>
>>>>
>>>> --
>>>> Peter N. Skomoroch
>>>> 617.285.8348
>>>> http://www.datawrangling.com
>>>> http://delicious.com/pskomoroch
>>>> http://twitter.com/peteskomoroch
>>>>
>>>>         
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>>     
>
>
>
>