You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2011/06/23 00:16:02 UTC

Algorithm for cross product

Assume I have two data sources A and B
Assume I have an input format and can generate key values for both A and B
I want an algorithm which will generate the cross product of all values in A
having the key K and all values in B having the
key K.
Currently I use a mapper to generate key values for A and  have the reducer
get all values in B with key K and hold them in memory.
It works but might not scale.

Any bright ideas?

-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Algorithm for cross product

Posted by Lance Norskog <go...@gmail.com>.
If you have scaling problems, check out the Mahout project. They are
all about distributed scalable linear algebra & more.
http://mahout.apache.org

Lance

On Wed, Jun 22, 2011 at 5:13 PM, Jason <ur...@gmail.com> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high level these are the steps:
>
> Suppose you decide to partition set A.
>
> Each partition represents a subset/range of the A keys and must be small enough to fit records in memory.
>
> Each partition gets sent to a separate reducer by the mapper and partitioner logic.
>
> The second data set B then is *duplicated* for each of the reducers again using some trivial logic in mapper and partitioner.
>
> This assumes that the reducers can process record from both A and B sets. Also all records from A preceed ones from B which is trivially done by sort comparator.
>
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A records already in memory and results are emitted.
>
> The job should scale in space as long as you have enough reducers assigned and will scale in time with more reducer machines.
>
>
> Sent from my iPhone
>
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <lo...@gmail.com> wrote:
>
>> Assume I have two data sources A and B
>> Assume I have an input format and can generate key values for both A and B
>> I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the
>> key K.
>> Currently I use a mapper to generate key values for A and  have the reducer get all values in B with key K and hold them in memory.
>> It works but might not scale.
>>
>> Any bright ideas?
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Algorithm for cross product

Posted by Jason <ur...@gmail.com>.
> One elegant solution would involve an ability to restart one of the input splitters and replay the input data from set A several times until the mapper had generated all sets of the form <key,(ai,bj>

By replaying A several times, did you actually  mean reading A set B times?
Cause in this case you may end up with a lot of IO use depending on size of your sets and a size of a single record. With partitioning approach, IO reads are naturally reduced by choosing partition size larger.

Sent from my iPhone

On Jun 23, 2011, at 10:07 AM, Steve Lewis <lo...@gmail.com> wrote:

> The approach you suggest is similar to what I am currently doing but it requires you to size the partitions to the memory available on the reducer. This is a non-trivial task and is not necessarily guaranteed to scale. It is true that the simplest approach is to break one of the sets into sufficiently small partitions to hold a partition in memory and then generate the Cartesian product but it is 
> a hack and makes assumptions about partition size. 
> One elegant solution would involve an ability to restart one of the input splitters and replay the input data from set A several times until the mapper had generated all sets of the form <key,(ai,bj>
> 
> 
> On Wed, Jun 22, 2011 at 5:13 PM, Jason <ur...@gmail.com> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high level these are the steps:
> 
> Suppose you decide to partition set A.
> 
> Each partition represents a subset/range of the A keys and must be small enough to fit records in memory.
> 
> Each partition gets sent to a separate reducer by the mapper and partitioner logic.
> 
> The second data set B then is *duplicated* for each of the reducers again using some trivial logic in mapper and partitioner.
> 
> This assumes that the reducers can process record from both A and B sets. Also all records from A preceed ones from B which is trivially done by sort comparator.
> 
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A records already in memory and results are emitted.
> 
> The job should scale in space as long as you have enough reducers assigned and will scale in time with more reducer machines.
> 
> 
> Sent from my iPhone
> 
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <lo...@gmail.com> wrote:
> 
> > Assume I have two data sources A and B
> > Assume I have an input format and can generate key values for both A and B
> > I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the
> > key K.
> > Currently I use a mapper to generate key values for A and  have the reducer get all values in B with key K and hold them in memory.
> > It works but might not scale.
> >
> > Any bright ideas?
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
> >
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 

Re: Algorithm for cross product

Posted by Steve Lewis <lo...@gmail.com>.
The approach you suggest is similar to what I am currently doing but it
requires you to size the partitions to the memory available on the reducer.
This is a non-trivial task and is not necessarily guaranteed to scale. It is
true that the simplest approach is to break one of the sets into
sufficiently small partitions to hold a partition in memory and then
generate the Cartesian product but it is
a hack and makes assumptions about partition size.
One elegant solution would involve an ability to restart one of the input
splitters and replay the input data from set A several times until the
mapper had generated all sets of the form <key,(ai,bj>


On Wed, Jun 22, 2011 at 5:13 PM, Jason <ur...@gmail.com> wrote:

> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high
> level these are the steps:
>
> Suppose you decide to partition set A.
>
> Each partition represents a subset/range of the A keys and must be small
> enough to fit records in memory.
>
> Each partition gets sent to a separate reducer by the mapper and
> partitioner logic.
>
> The second data set B then is *duplicated* for each of the reducers again
> using some trivial logic in mapper and partitioner.
>
> This assumes that the reducers can process record from both A and B sets.
> Also all records from A preceed ones from B which is trivially done by sort
> comparator.
>
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A
> records already in memory and results are emitted.
>
> The job should scale in space as long as you have enough reducers assigned
> and will scale in time with more reducer machines.
>
>
> Sent from my iPhone
>
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <lo...@gmail.com> wrote:
>
> > Assume I have two data sources A and B
> > Assume I have an input format and can generate key values for both A and
> B
> > I want an algorithm which will generate the cross product of all values
> in A having the key K and all values in B having the
> > key K.
> > Currently I use a mapper to generate key values for A and  have the
> reducer get all values in B with key K and hold them in memory.
> > It works but might not scale.
> >
> > Any bright ideas?
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
> >
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Algorithm for cross product

Posted by Jason <ur...@gmail.com>.
I remember I had a similar problem.
The way I approached it was by partitioning one of the data sets. At high level these are the steps:

Suppose you decide to partition set A.

Each partition represents a subset/range of the A keys and must be small enough to fit records in memory. 

Each partition gets sent to a separate reducer by the mapper and partitioner logic.

The second data set B then is *duplicated* for each of the reducers again using some trivial logic in mapper and partitioner.

This assumes that the reducers can process record from both A and B sets. Also all records from A preceed ones from B which is trivially done by sort comparator.

When a reducer receives a record from A set, it stores it in memory. 
When a record from set B arrives, the cross product is computed with all A records already in memory and results are emitted.

The job should scale in space as long as you have enough reducers assigned and will scale in time with more reducer machines.


Sent from my iPhone

On Jun 22, 2011, at 3:16 PM, Steve Lewis <lo...@gmail.com> wrote:

> Assume I have two data sources A and B 
> Assume I have an input format and can generate key values for both A and B
> I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the 
> key K. 
> Currently I use a mapper to generate key values for A and  have the reducer get all values in B with key K and hold them in memory.
> It works but might not scale.
> 
> Any bright ideas?
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 

Re: Algorithm for cross product

Posted by John Armstrong <jo...@ccri.com>.
On Wed, 22 Jun 2011 15:16:02 -0700, Steve Lewis <lo...@gmail.com>
wrote:
> Assume I have two data sources A and B
> Assume I have an input format and can generate key values for both A and
B
> I want an algorithm which will generate the cross product of all values
in
> A
> having the key K and all values in B having the
> key K.
> Currently I use a mapper to generate key values for A and  have the
reducer
> get all values in B with key K and hold them in memory.
> It works but might not scale.
> 
> Any bright ideas?

I was just thinking about a more general version of this problem.

If we think of a mapper as saying "for all <foo> in <set A> do
<something>", then what if we have two inputs?  That is, what if we want to
map over the cartesian product of two input sets?  "for all <(foo,bar)> in
<set A x set B> do <something>"

I'm thinking this should be a new InputFormat, which takes two
InputFormats (with configuration information) as parameters.  Then an
InputSplit for the product input is an input entry from A "times" an
InputSplit from B.  I haven't worked out the details, but this should be
the basic idea.

Thoughts?