You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by zheyi rong <zh...@gmail.com> on 2013/04/18 11:47:48 UTC

Cartesian product in hadoop

Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million
lines.
I want to do cartesian product on these two datasets, comparing lines
pairwisely.

The output of each comparison can be mostly filtered by a function ( we do
not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block
from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ted Dunning,

could you please tell me some keywords so that I can google it myself?

Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 8:52 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is rarely practical to do exhaustive comparisons on datasets of this
> size.
>
> The method used is to heuristically prune the cartesian product set and
> only examine pairs that have a high likelihood of being near.
>
> This can be done in many ways.  Your suggestion of doing a map-side join
> is a reasonable one, but it will be much slower than methods where you can
> prune the comparisons.
>
>
>
> On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ted Dunning,

could you please tell me some keywords so that I can google it myself?

Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 8:52 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is rarely practical to do exhaustive comparisons on datasets of this
> size.
>
> The method used is to heuristically prune the cartesian product set and
> only examine pairs that have a high likelihood of being near.
>
> This can be done in many ways.  Your suggestion of doing a map-side join
> is a reasonable one, but it will be much slower than methods where you can
> prune the comparisons.
>
>
>
> On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ted Dunning,

could you please tell me some keywords so that I can google it myself?

Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 8:52 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is rarely practical to do exhaustive comparisons on datasets of this
> size.
>
> The method used is to heuristically prune the cartesian product set and
> only examine pairs that have a high likelihood of being near.
>
> This can be done in many ways.  Your suggestion of doing a map-side join
> is a reasonable one, but it will be much slower than methods where you can
> prune the comparisons.
>
>
>
> On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ted Dunning,

could you please tell me some keywords so that I can google it myself?

Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 8:52 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is rarely practical to do exhaustive comparisons on datasets of this
> size.
>
> The method used is to heuristically prune the cartesian product set and
> only examine pairs that have a high likelihood of being near.
>
> This can be done in many ways.  Your suggestion of doing a map-side join
> is a reasonable one, but it will be much slower than methods where you can
> prune the comparisons.
>
>
>
> On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by Ted Dunning <td...@maprtech.com>.
It is rarely practical to do exhaustive comparisons on datasets of this
size.

The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.

This can be done in many ways.  Your suggestion of doing a map-side join is
a reasonable one, but it will be much slower than methods where you can
prune the comparisons.



On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by Ted Dunning <td...@maprtech.com>.
It is rarely practical to do exhaustive comparisons on datasets of this
size.

The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.

This can be done in many ways.  Your suggestion of doing a map-side join is
a reasonable one, but it will be much slower than methods where you can
prune the comparisons.



On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by Ted Dunning <td...@maprtech.com>.
It is rarely practical to do exhaustive comparisons on datasets of this
size.

The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.

This can be done in many ways.  Your suggestion of doing a map-side join is
a reasonable one, but it will be much slower than methods where you can
prune the comparisons.



On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

thank you for your explanation.



Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 5:18 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>
>  The approach which I proposed will have m+n i/o for reading datasets not
> the (m + n + m*n) and but further i/o due to spills and reading mapper
> output by reducer will be more as number of tuples coming out of mapper are
> ( m + m * n).
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:
>
>  Thank you, now I get your point.
>
>  But I wonder that this approach would be slower than
> implementing a custom InputFormat which, each time, provides a pair of
> lines to mappers; then doing the product in mappers?  (in
>
>  Since your approach would need (m + n + m*n) I/O in mapper side, and
> (2*m*n) IO in reducer side;
> while with implementing a custom InputFormat, the I/O is (m*n).
>
>  I am asking this because I have implemented the custom InputFormat, but
> the running time is still intolerable in our small cluster.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Yes, that's a crucial part.
>>
>>  Write a class which extends WritableComparator and override compare
>> method.
>> You need to set this class in job client as -
>> job.setGroupingComparatorClass (Grouping comparator class).
>>
>>  This will make sure that records having same Ki will be grouped
>> together and will go to same iteration of reduce.
>> I forgot to mention in my previous post to write a partitioner too which
>> partitions data based on first part of key.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>>
>>  Hi Ajay Srivastava,
>>
>>  Thank your for your reply.
>>
>>  Could you please explain a little bit more on "Write a grouping
>> comparator which group records on first part of key i.e. Ki."  ?
>> I guess it is a crucial part, which could filter some pairs before
>> passing them to the reducer.
>>
>>
>>  Regards,
>> Zheyi Rong
>>
>>
>> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
>> Ajay.Srivastava@guavus.com> wrote:
>>
>>>  Hi Rong,
>>> You can use following simple method.
>>>
>>>  Lets say dataset1 has m records and when you emit these records from
>>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>>> identifier to identify dataset from where records is being emitted.
>>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>>> DATASET1) and value as R1.
>>>
>>>  For dataset2 having n records, emit m records for each record with
>>> keys K1, K2, …., Km and identifier as DATASET2.
>>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>>> DATASET2) and value R1' where i is from 1 to m.
>>>
>>>
>>>  Write a grouping comparator which group records on first part of key
>>> i.e. Ki.
>>>
>>>  In reducer, for each iteration of reduce there will be one record from
>>> dataset1 and n records from dataset2. Get the cartesian product, apply
>>> filter and then output.
>>>
>>>
>>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>>> then you need one more pass of dataset1 to identify the keys and store it
>>> to use for dataset2.
>>>
>>>
>>>  Regards,
>>> Ajay Srivastava
>>>
>>>
>>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>>
>>>  This is not suitable for his large dataset.
>>>
>>> --Send from my Sony mobile.
>>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>>
>>>>  Hi,
>>>>
>>>> Can you have a look at
>>>>
>>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>>
>>>>  Thanks
>>>>
>>>>
>>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>>> hadoop.
>>>>> Specifically, now I have two datasets, each of which contains
>>>>> 20million lines.
>>>>> I want to do cartesian product on these two datasets, comparing lines
>>>>> pairwisely.
>>>>>
>>>>>  The output of each comparison can be mostly filtered by a function (
>>>>> we do not output the
>>>>> whole result of this cartesian product, but only a small part).
>>>>>
>>>>>  I guess one good way is to pass one block from dataset1 and another
>>>>> block from dataset2
>>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>>
>>>>>  Any suggestions?
>>>>> Thank you very much.
>>>>>
>>>>>  Regards,
>>>>> Zheyi Rong
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

thank you for your explanation.



Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 5:18 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>
>  The approach which I proposed will have m+n i/o for reading datasets not
> the (m + n + m*n) and but further i/o due to spills and reading mapper
> output by reducer will be more as number of tuples coming out of mapper are
> ( m + m * n).
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:
>
>  Thank you, now I get your point.
>
>  But I wonder that this approach would be slower than
> implementing a custom InputFormat which, each time, provides a pair of
> lines to mappers; then doing the product in mappers?  (in
>
>  Since your approach would need (m + n + m*n) I/O in mapper side, and
> (2*m*n) IO in reducer side;
> while with implementing a custom InputFormat, the I/O is (m*n).
>
>  I am asking this because I have implemented the custom InputFormat, but
> the running time is still intolerable in our small cluster.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Yes, that's a crucial part.
>>
>>  Write a class which extends WritableComparator and override compare
>> method.
>> You need to set this class in job client as -
>> job.setGroupingComparatorClass (Grouping comparator class).
>>
>>  This will make sure that records having same Ki will be grouped
>> together and will go to same iteration of reduce.
>> I forgot to mention in my previous post to write a partitioner too which
>> partitions data based on first part of key.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>>
>>  Hi Ajay Srivastava,
>>
>>  Thank your for your reply.
>>
>>  Could you please explain a little bit more on "Write a grouping
>> comparator which group records on first part of key i.e. Ki."  ?
>> I guess it is a crucial part, which could filter some pairs before
>> passing them to the reducer.
>>
>>
>>  Regards,
>> Zheyi Rong
>>
>>
>> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
>> Ajay.Srivastava@guavus.com> wrote:
>>
>>>  Hi Rong,
>>> You can use following simple method.
>>>
>>>  Lets say dataset1 has m records and when you emit these records from
>>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>>> identifier to identify dataset from where records is being emitted.
>>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>>> DATASET1) and value as R1.
>>>
>>>  For dataset2 having n records, emit m records for each record with
>>> keys K1, K2, …., Km and identifier as DATASET2.
>>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>>> DATASET2) and value R1' where i is from 1 to m.
>>>
>>>
>>>  Write a grouping comparator which group records on first part of key
>>> i.e. Ki.
>>>
>>>  In reducer, for each iteration of reduce there will be one record from
>>> dataset1 and n records from dataset2. Get the cartesian product, apply
>>> filter and then output.
>>>
>>>
>>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>>> then you need one more pass of dataset1 to identify the keys and store it
>>> to use for dataset2.
>>>
>>>
>>>  Regards,
>>> Ajay Srivastava
>>>
>>>
>>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>>
>>>  This is not suitable for his large dataset.
>>>
>>> --Send from my Sony mobile.
>>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>>
>>>>  Hi,
>>>>
>>>> Can you have a look at
>>>>
>>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>>
>>>>  Thanks
>>>>
>>>>
>>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>>> hadoop.
>>>>> Specifically, now I have two datasets, each of which contains
>>>>> 20million lines.
>>>>> I want to do cartesian product on these two datasets, comparing lines
>>>>> pairwisely.
>>>>>
>>>>>  The output of each comparison can be mostly filtered by a function (
>>>>> we do not output the
>>>>> whole result of this cartesian product, but only a small part).
>>>>>
>>>>>  I guess one good way is to pass one block from dataset1 and another
>>>>> block from dataset2
>>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>>
>>>>>  Any suggestions?
>>>>> Thank you very much.
>>>>>
>>>>>  Regards,
>>>>> Zheyi Rong
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

thank you for your explanation.



Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 5:18 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>
>  The approach which I proposed will have m+n i/o for reading datasets not
> the (m + n + m*n) and but further i/o due to spills and reading mapper
> output by reducer will be more as number of tuples coming out of mapper are
> ( m + m * n).
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:
>
>  Thank you, now I get your point.
>
>  But I wonder that this approach would be slower than
> implementing a custom InputFormat which, each time, provides a pair of
> lines to mappers; then doing the product in mappers?  (in
>
>  Since your approach would need (m + n + m*n) I/O in mapper side, and
> (2*m*n) IO in reducer side;
> while with implementing a custom InputFormat, the I/O is (m*n).
>
>  I am asking this because I have implemented the custom InputFormat, but
> the running time is still intolerable in our small cluster.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Yes, that's a crucial part.
>>
>>  Write a class which extends WritableComparator and override compare
>> method.
>> You need to set this class in job client as -
>> job.setGroupingComparatorClass (Grouping comparator class).
>>
>>  This will make sure that records having same Ki will be grouped
>> together and will go to same iteration of reduce.
>> I forgot to mention in my previous post to write a partitioner too which
>> partitions data based on first part of key.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>>
>>  Hi Ajay Srivastava,
>>
>>  Thank your for your reply.
>>
>>  Could you please explain a little bit more on "Write a grouping
>> comparator which group records on first part of key i.e. Ki."  ?
>> I guess it is a crucial part, which could filter some pairs before
>> passing them to the reducer.
>>
>>
>>  Regards,
>> Zheyi Rong
>>
>>
>> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
>> Ajay.Srivastava@guavus.com> wrote:
>>
>>>  Hi Rong,
>>> You can use following simple method.
>>>
>>>  Lets say dataset1 has m records and when you emit these records from
>>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>>> identifier to identify dataset from where records is being emitted.
>>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>>> DATASET1) and value as R1.
>>>
>>>  For dataset2 having n records, emit m records for each record with
>>> keys K1, K2, …., Km and identifier as DATASET2.
>>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>>> DATASET2) and value R1' where i is from 1 to m.
>>>
>>>
>>>  Write a grouping comparator which group records on first part of key
>>> i.e. Ki.
>>>
>>>  In reducer, for each iteration of reduce there will be one record from
>>> dataset1 and n records from dataset2. Get the cartesian product, apply
>>> filter and then output.
>>>
>>>
>>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>>> then you need one more pass of dataset1 to identify the keys and store it
>>> to use for dataset2.
>>>
>>>
>>>  Regards,
>>> Ajay Srivastava
>>>
>>>
>>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>>
>>>  This is not suitable for his large dataset.
>>>
>>> --Send from my Sony mobile.
>>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>>
>>>>  Hi,
>>>>
>>>> Can you have a look at
>>>>
>>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>>
>>>>  Thanks
>>>>
>>>>
>>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>>> hadoop.
>>>>> Specifically, now I have two datasets, each of which contains
>>>>> 20million lines.
>>>>> I want to do cartesian product on these two datasets, comparing lines
>>>>> pairwisely.
>>>>>
>>>>>  The output of each comparison can be mostly filtered by a function (
>>>>> we do not output the
>>>>> whole result of this cartesian product, but only a small part).
>>>>>
>>>>>  I guess one good way is to pass one block from dataset1 and another
>>>>> block from dataset2
>>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>>
>>>>>  Any suggestions?
>>>>> Thank you very much.
>>>>>
>>>>>  Regards,
>>>>> Zheyi Rong
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

thank you for your explanation.



Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 5:18 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>
>  The approach which I proposed will have m+n i/o for reading datasets not
> the (m + n + m*n) and but further i/o due to spills and reading mapper
> output by reducer will be more as number of tuples coming out of mapper are
> ( m + m * n).
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:
>
>  Thank you, now I get your point.
>
>  But I wonder that this approach would be slower than
> implementing a custom InputFormat which, each time, provides a pair of
> lines to mappers; then doing the product in mappers?  (in
>
>  Since your approach would need (m + n + m*n) I/O in mapper side, and
> (2*m*n) IO in reducer side;
> while with implementing a custom InputFormat, the I/O is (m*n).
>
>  I am asking this because I have implemented the custom InputFormat, but
> the running time is still intolerable in our small cluster.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Yes, that's a crucial part.
>>
>>  Write a class which extends WritableComparator and override compare
>> method.
>> You need to set this class in job client as -
>> job.setGroupingComparatorClass (Grouping comparator class).
>>
>>  This will make sure that records having same Ki will be grouped
>> together and will go to same iteration of reduce.
>> I forgot to mention in my previous post to write a partitioner too which
>> partitions data based on first part of key.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>>
>>  Hi Ajay Srivastava,
>>
>>  Thank your for your reply.
>>
>>  Could you please explain a little bit more on "Write a grouping
>> comparator which group records on first part of key i.e. Ki."  ?
>> I guess it is a crucial part, which could filter some pairs before
>> passing them to the reducer.
>>
>>
>>  Regards,
>> Zheyi Rong
>>
>>
>> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
>> Ajay.Srivastava@guavus.com> wrote:
>>
>>>  Hi Rong,
>>> You can use following simple method.
>>>
>>>  Lets say dataset1 has m records and when you emit these records from
>>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>>> identifier to identify dataset from where records is being emitted.
>>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>>> DATASET1) and value as R1.
>>>
>>>  For dataset2 having n records, emit m records for each record with
>>> keys K1, K2, …., Km and identifier as DATASET2.
>>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>>> DATASET2) and value R1' where i is from 1 to m.
>>>
>>>
>>>  Write a grouping comparator which group records on first part of key
>>> i.e. Ki.
>>>
>>>  In reducer, for each iteration of reduce there will be one record from
>>> dataset1 and n records from dataset2. Get the cartesian product, apply
>>> filter and then output.
>>>
>>>
>>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>>> then you need one more pass of dataset1 to identify the keys and store it
>>> to use for dataset2.
>>>
>>>
>>>  Regards,
>>> Ajay Srivastava
>>>
>>>
>>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>>
>>>  This is not suitable for his large dataset.
>>>
>>> --Send from my Sony mobile.
>>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>>
>>>>  Hi,
>>>>
>>>> Can you have a look at
>>>>
>>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>>
>>>>  Thanks
>>>>
>>>>
>>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>>> hadoop.
>>>>> Specifically, now I have two datasets, each of which contains
>>>>> 20million lines.
>>>>> I want to do cartesian product on these two datasets, comparing lines
>>>>> pairwisely.
>>>>>
>>>>>  The output of each comparison can be mostly filtered by a function (
>>>>> we do not output the
>>>>> whole result of this cartesian product, but only a small part).
>>>>>
>>>>>  I guess one good way is to pass one block from dataset1 and another
>>>>> block from dataset2
>>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>>
>>>>>  Any suggestions?
>>>>> Thank you very much.
>>>>>
>>>>>  Regards,
>>>>> Zheyi Rong
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
The approach which I proposed will have m+n i/o for reading datasets not the (m + n + m*n) and but further i/o due to spills and reading mapper output by reducer will be more as number of tuples coming out of mapper are ( m + m * n).


Regards,
Ajay Srivastava


On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:

Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and (2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong







Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
The approach which I proposed will have m+n i/o for reading datasets not the (m + n + m*n) and but further i/o due to spills and reading mapper output by reducer will be more as number of tuples coming out of mapper are ( m + m * n).


Regards,
Ajay Srivastava


On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:

Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and (2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong







Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
The approach which I proposed will have m+n i/o for reading datasets not the (m + n + m*n) and but further i/o due to spills and reading mapper output by reducer will be more as number of tuples coming out of mapper are ( m + m * n).


Regards,
Ajay Srivastava


On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:

Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and (2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong







Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
The approach which I proposed will have m+n i/o for reading datasets not the (m + n + m*n) and but further i/o due to spills and reading mapper output by reducer will be more as number of tuples coming out of mapper are ( m + m * n).


Regards,
Ajay Srivastava


On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:

Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and (2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong







Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of
lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and
(2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the
running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>  Yes, that's a crucial part.
>
>  Write a class which extends WritableComparator and override compare
> method.
> You need to set this class in job client as -
> job.setGroupingComparatorClass (Grouping comparator class).
>
>  This will make sure that records having same Ki will be grouped together
> and will go to same iteration of reduce.
> I forgot to mention in my previous post to write a partitioner too which
> partitions data based on first part of key.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>
>  Hi Ajay Srivastava,
>
>  Thank your for your reply.
>
>  Could you please explain a little bit more on "Write a grouping
> comparator which group records on first part of key i.e. Ki."  ?
> I guess it is a crucial part, which could filter some pairs before passing
> them to the reducer.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Hi Rong,
>> You can use following simple method.
>>
>>  Lets say dataset1 has m records and when you emit these records from
>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>> identifier to identify dataset from where records is being emitted.
>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>> DATASET1) and value as R1.
>>
>>  For dataset2 having n records, emit m records for each record with keys
>> K1, K2, …., Km and identifier as DATASET2.
>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>> DATASET2) and value R1' where i is from 1 to m.
>>
>>
>>  Write a grouping comparator which group records on first part of key
>> i.e. Ki.
>>
>>  In reducer, for each iteration of reduce there will be one record from
>> dataset1 and n records from dataset2. Get the cartesian product, apply
>> filter and then output.
>>
>>
>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>> then you need one more pass of dataset1 to identify the keys and store it
>> to use for dataset2.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>
>>  This is not suitable for his large dataset.
>>
>> --Send from my Sony mobile.
>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>
>>>  Hi,
>>>
>>> Can you have a look at
>>>
>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>
>>>  Thanks
>>>
>>>
>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>
>>>> Dear all,
>>>>
>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>> hadoop.
>>>> Specifically, now I have two datasets, each of which contains 20million
>>>> lines.
>>>> I want to do cartesian product on these two datasets, comparing lines
>>>> pairwisely.
>>>>
>>>>  The output of each comparison can be mostly filtered by a function (
>>>> we do not output the
>>>> whole result of this cartesian product, but only a small part).
>>>>
>>>>  I guess one good way is to pass one block from dataset1 and another
>>>> block from dataset2
>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>
>>>>  Any suggestions?
>>>> Thank you very much.
>>>>
>>>>  Regards,
>>>> Zheyi Rong
>>>>
>>>
>>>
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of
lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and
(2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the
running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>  Yes, that's a crucial part.
>
>  Write a class which extends WritableComparator and override compare
> method.
> You need to set this class in job client as -
> job.setGroupingComparatorClass (Grouping comparator class).
>
>  This will make sure that records having same Ki will be grouped together
> and will go to same iteration of reduce.
> I forgot to mention in my previous post to write a partitioner too which
> partitions data based on first part of key.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>
>  Hi Ajay Srivastava,
>
>  Thank your for your reply.
>
>  Could you please explain a little bit more on "Write a grouping
> comparator which group records on first part of key i.e. Ki."  ?
> I guess it is a crucial part, which could filter some pairs before passing
> them to the reducer.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Hi Rong,
>> You can use following simple method.
>>
>>  Lets say dataset1 has m records and when you emit these records from
>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>> identifier to identify dataset from where records is being emitted.
>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>> DATASET1) and value as R1.
>>
>>  For dataset2 having n records, emit m records for each record with keys
>> K1, K2, …., Km and identifier as DATASET2.
>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>> DATASET2) and value R1' where i is from 1 to m.
>>
>>
>>  Write a grouping comparator which group records on first part of key
>> i.e. Ki.
>>
>>  In reducer, for each iteration of reduce there will be one record from
>> dataset1 and n records from dataset2. Get the cartesian product, apply
>> filter and then output.
>>
>>
>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>> then you need one more pass of dataset1 to identify the keys and store it
>> to use for dataset2.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>
>>  This is not suitable for his large dataset.
>>
>> --Send from my Sony mobile.
>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>
>>>  Hi,
>>>
>>> Can you have a look at
>>>
>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>
>>>  Thanks
>>>
>>>
>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>
>>>> Dear all,
>>>>
>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>> hadoop.
>>>> Specifically, now I have two datasets, each of which contains 20million
>>>> lines.
>>>> I want to do cartesian product on these two datasets, comparing lines
>>>> pairwisely.
>>>>
>>>>  The output of each comparison can be mostly filtered by a function (
>>>> we do not output the
>>>> whole result of this cartesian product, but only a small part).
>>>>
>>>>  I guess one good way is to pass one block from dataset1 and another
>>>> block from dataset2
>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>
>>>>  Any suggestions?
>>>> Thank you very much.
>>>>
>>>>  Regards,
>>>> Zheyi Rong
>>>>
>>>
>>>
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of
lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and
(2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the
running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>  Yes, that's a crucial part.
>
>  Write a class which extends WritableComparator and override compare
> method.
> You need to set this class in job client as -
> job.setGroupingComparatorClass (Grouping comparator class).
>
>  This will make sure that records having same Ki will be grouped together
> and will go to same iteration of reduce.
> I forgot to mention in my previous post to write a partitioner too which
> partitions data based on first part of key.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>
>  Hi Ajay Srivastava,
>
>  Thank your for your reply.
>
>  Could you please explain a little bit more on "Write a grouping
> comparator which group records on first part of key i.e. Ki."  ?
> I guess it is a crucial part, which could filter some pairs before passing
> them to the reducer.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Hi Rong,
>> You can use following simple method.
>>
>>  Lets say dataset1 has m records and when you emit these records from
>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>> identifier to identify dataset from where records is being emitted.
>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>> DATASET1) and value as R1.
>>
>>  For dataset2 having n records, emit m records for each record with keys
>> K1, K2, …., Km and identifier as DATASET2.
>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>> DATASET2) and value R1' where i is from 1 to m.
>>
>>
>>  Write a grouping comparator which group records on first part of key
>> i.e. Ki.
>>
>>  In reducer, for each iteration of reduce there will be one record from
>> dataset1 and n records from dataset2. Get the cartesian product, apply
>> filter and then output.
>>
>>
>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>> then you need one more pass of dataset1 to identify the keys and store it
>> to use for dataset2.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>
>>  This is not suitable for his large dataset.
>>
>> --Send from my Sony mobile.
>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>
>>>  Hi,
>>>
>>> Can you have a look at
>>>
>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>
>>>  Thanks
>>>
>>>
>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>
>>>> Dear all,
>>>>
>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>> hadoop.
>>>> Specifically, now I have two datasets, each of which contains 20million
>>>> lines.
>>>> I want to do cartesian product on these two datasets, comparing lines
>>>> pairwisely.
>>>>
>>>>  The output of each comparison can be mostly filtered by a function (
>>>> we do not output the
>>>> whole result of this cartesian product, but only a small part).
>>>>
>>>>  I guess one good way is to pass one block from dataset1 and another
>>>> block from dataset2
>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>
>>>>  Any suggestions?
>>>> Thank you very much.
>>>>
>>>>  Regards,
>>>> Zheyi Rong
>>>>
>>>
>>>
>>
>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Thank you, now I get your point.

But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of
lines to mappers; then doing the product in mappers?  (in

Since your approach would need (m + n + m*n) I/O in mapper side, and
(2*m*n) IO in reducer side;
while with implementing a custom InputFormat, the I/O is (m*n).

I am asking this because I have implemented the custom InputFormat, but the
running time is still intolerable in our small cluster.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>  Yes, that's a crucial part.
>
>  Write a class which extends WritableComparator and override compare
> method.
> You need to set this class in job client as -
> job.setGroupingComparatorClass (Grouping comparator class).
>
>  This will make sure that records having same Ki will be grouped together
> and will go to same iteration of reduce.
> I forgot to mention in my previous post to write a partitioner too which
> partitions data based on first part of key.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>
>  Hi Ajay Srivastava,
>
>  Thank your for your reply.
>
>  Could you please explain a little bit more on "Write a grouping
> comparator which group records on first part of key i.e. Ki."  ?
> I guess it is a crucial part, which could filter some pairs before passing
> them to the reducer.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Hi Rong,
>> You can use following simple method.
>>
>>  Lets say dataset1 has m records and when you emit these records from
>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>> identifier to identify dataset from where records is being emitted.
>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>> DATASET1) and value as R1.
>>
>>  For dataset2 having n records, emit m records for each record with keys
>> K1, K2, …., Km and identifier as DATASET2.
>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>> DATASET2) and value R1' where i is from 1 to m.
>>
>>
>>  Write a grouping comparator which group records on first part of key
>> i.e. Ki.
>>
>>  In reducer, for each iteration of reduce there will be one record from
>> dataset1 and n records from dataset2. Get the cartesian product, apply
>> filter and then output.
>>
>>
>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>> then you need one more pass of dataset1 to identify the keys and store it
>> to use for dataset2.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>
>>  This is not suitable for his large dataset.
>>
>> --Send from my Sony mobile.
>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>>
>>>  Hi,
>>>
>>> Can you have a look at
>>>
>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>
>>>  Thanks
>>>
>>>
>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>wrote:
>>>
>>>> Dear all,
>>>>
>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>> hadoop.
>>>> Specifically, now I have two datasets, each of which contains 20million
>>>> lines.
>>>> I want to do cartesian product on these two datasets, comparing lines
>>>> pairwisely.
>>>>
>>>>  The output of each comparison can be mostly filtered by a function (
>>>> we do not output the
>>>> whole result of this cartesian product, but only a small part).
>>>>
>>>>  I guess one good way is to pass one block from dataset1 and another
>>>> block from dataset2
>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>
>>>>  Any suggestions?
>>>> Thank you very much.
>>>>
>>>>  Regards,
>>>> Zheyi Rong
>>>>
>>>
>>>
>>
>
>

Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong





Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong





Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong





Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong





Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
Ajay.Srivastava@guavus.com> wrote:

>  Hi Rong,
> You can use following simple method.
>
>  Lets say dataset1 has m records and when you emit these records from
> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
> identifier to identify dataset from where records is being emitted.
> So if R1 is a record in dataset1, the mapper will emit key as (K1,
> DATASET1) and value as R1.
>
>  For dataset2 having n records, emit m records for each record with keys
> K1, K2, …., Km and identifier as DATASET2.
> So if R1' is a record from dataset2, emit m records with key as  (Ki,
> DATASET2) and value R1' where i is from 1 to m.
>
>
>  Write a grouping comparator which group records on first part of key
> i.e. Ki.
>
>  In reducer, for each iteration of reduce there will be one record from
> dataset1 and n records from dataset2. Get the cartesian product, apply
> filter and then output.
>
>
>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then
> you need one more pass of dataset1 to identify the keys and store it to use
> for dataset2.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>
>  This is not suitable for his large dataset.
>
> --Send from my Sony mobile.
> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>
>>  Hi,
>>
>> Can you have a look at
>>
>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>
>>  Thanks
>>
>>
>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>> hadoop.
>>> Specifically, now I have two datasets, each of which contains 20million
>>> lines.
>>> I want to do cartesian product on these two datasets, comparing lines
>>> pairwisely.
>>>
>>>  The output of each comparison can be mostly filtered by a function (
>>> we do not output the
>>> whole result of this cartesian product, but only a small part).
>>>
>>>  I guess one good way is to pass one block from dataset1 and another
>>> block from dataset2
>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>
>>>  Any suggestions?
>>> Thank you very much.
>>>
>>>  Regards,
>>> Zheyi Rong
>>>
>>
>>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
Ajay.Srivastava@guavus.com> wrote:

>  Hi Rong,
> You can use following simple method.
>
>  Lets say dataset1 has m records and when you emit these records from
> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
> identifier to identify dataset from where records is being emitted.
> So if R1 is a record in dataset1, the mapper will emit key as (K1,
> DATASET1) and value as R1.
>
>  For dataset2 having n records, emit m records for each record with keys
> K1, K2, …., Km and identifier as DATASET2.
> So if R1' is a record from dataset2, emit m records with key as  (Ki,
> DATASET2) and value R1' where i is from 1 to m.
>
>
>  Write a grouping comparator which group records on first part of key
> i.e. Ki.
>
>  In reducer, for each iteration of reduce there will be one record from
> dataset1 and n records from dataset2. Get the cartesian product, apply
> filter and then output.
>
>
>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then
> you need one more pass of dataset1 to identify the keys and store it to use
> for dataset2.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>
>  This is not suitable for his large dataset.
>
> --Send from my Sony mobile.
> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>
>>  Hi,
>>
>> Can you have a look at
>>
>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>
>>  Thanks
>>
>>
>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>> hadoop.
>>> Specifically, now I have two datasets, each of which contains 20million
>>> lines.
>>> I want to do cartesian product on these two datasets, comparing lines
>>> pairwisely.
>>>
>>>  The output of each comparison can be mostly filtered by a function (
>>> we do not output the
>>> whole result of this cartesian product, but only a small part).
>>>
>>>  I guess one good way is to pass one block from dataset1 and another
>>> block from dataset2
>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>
>>>  Any suggestions?
>>> Thank you very much.
>>>
>>>  Regards,
>>> Zheyi Rong
>>>
>>
>>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
Ajay.Srivastava@guavus.com> wrote:

>  Hi Rong,
> You can use following simple method.
>
>  Lets say dataset1 has m records and when you emit these records from
> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
> identifier to identify dataset from where records is being emitted.
> So if R1 is a record in dataset1, the mapper will emit key as (K1,
> DATASET1) and value as R1.
>
>  For dataset2 having n records, emit m records for each record with keys
> K1, K2, …., Km and identifier as DATASET2.
> So if R1' is a record from dataset2, emit m records with key as  (Ki,
> DATASET2) and value R1' where i is from 1 to m.
>
>
>  Write a grouping comparator which group records on first part of key
> i.e. Ki.
>
>  In reducer, for each iteration of reduce there will be one record from
> dataset1 and n records from dataset2. Get the cartesian product, apply
> filter and then output.
>
>
>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then
> you need one more pass of dataset1 to identify the keys and store it to use
> for dataset2.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>
>  This is not suitable for his large dataset.
>
> --Send from my Sony mobile.
> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>
>>  Hi,
>>
>> Can you have a look at
>>
>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>
>>  Thanks
>>
>>
>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>> hadoop.
>>> Specifically, now I have two datasets, each of which contains 20million
>>> lines.
>>> I want to do cartesian product on these two datasets, comparing lines
>>> pairwisely.
>>>
>>>  The output of each comparison can be mostly filtered by a function (
>>> we do not output the
>>> whole result of this cartesian product, but only a small part).
>>>
>>>  I guess one good way is to pass one block from dataset1 and another
>>> block from dataset2
>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>
>>>  Any suggestions?
>>> Thank you very much.
>>>
>>>  Regards,
>>> Zheyi Rong
>>>
>>
>>
>

Re: Cartesian product in hadoop

Posted by zheyi rong <zh...@gmail.com>.
Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
Ajay.Srivastava@guavus.com> wrote:

>  Hi Rong,
> You can use following simple method.
>
>  Lets say dataset1 has m records and when you emit these records from
> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
> identifier to identify dataset from where records is being emitted.
> So if R1 is a record in dataset1, the mapper will emit key as (K1,
> DATASET1) and value as R1.
>
>  For dataset2 having n records, emit m records for each record with keys
> K1, K2, …., Km and identifier as DATASET2.
> So if R1' is a record from dataset2, emit m records with key as  (Ki,
> DATASET2) and value R1' where i is from 1 to m.
>
>
>  Write a grouping comparator which group records on first part of key
> i.e. Ki.
>
>  In reducer, for each iteration of reduce there will be one record from
> dataset1 and n records from dataset2. Get the cartesian product, apply
> filter and then output.
>
>
>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then
> you need one more pass of dataset1 to identify the keys and store it to use
> for dataset2.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>
>  This is not suitable for his large dataset.
>
> --Send from my Sony mobile.
> On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:
>
>>  Hi,
>>
>> Can you have a look at
>>
>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>
>>  Thanks
>>
>>
>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>> hadoop.
>>> Specifically, now I have two datasets, each of which contains 20million
>>> lines.
>>> I want to do cartesian product on these two datasets, comparing lines
>>> pairwisely.
>>>
>>>  The output of each comparison can be mostly filtered by a function (
>>> we do not output the
>>> whole result of this cartesian product, but only a small part).
>>>
>>>  I guess one good way is to pass one block from dataset1 and another
>>> block from dataset2
>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>
>>>  Any suggestions?
>>> Thank you very much.
>>>
>>>  Regards,
>>> Zheyi Rong
>>>
>>
>>
>

Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong



Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong



Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong



Re: Cartesian product in hadoop

Posted by Ajay Srivastava <Aj...@guavus.com>.
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong



Re: Cartesian product in hadoop

Posted by Azuryy Yu <az...@gmail.com>.
This is not suitable for his large dataset.

--Send from my Sony mobile.
On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:

> Hi,
>
> Can you have a look at
>
> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>
> Thanks
>
>
> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by Azuryy Yu <az...@gmail.com>.
This is not suitable for his large dataset.

--Send from my Sony mobile.
On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:

> Hi,
>
> Can you have a look at
>
> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>
> Thanks
>
>
> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by Azuryy Yu <az...@gmail.com>.
This is not suitable for his large dataset.

--Send from my Sony mobile.
On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:

> Hi,
>
> Can you have a look at
>
> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>
> Thanks
>
>
> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by Azuryy Yu <az...@gmail.com>.
This is not suitable for his large dataset.

--Send from my Sony mobile.
On Apr 18, 2013 5:58 PM, "Jagat Singh" <ja...@gmail.com> wrote:

> Hi,
>
> Can you have a look at
>
> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>
> Thanks
>
>
> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:
>
>> Dear all,
>>
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>>
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>>
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>
>> Any suggestions?
>> Thank you very much.
>>
>> Regards,
>> Zheyi Rong
>>
>
>

Re: Cartesian product in hadoop

Posted by Jagat Singh <ja...@gmail.com>.
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by Jagat Singh <ja...@gmail.com>.
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by Ted Dunning <td...@maprtech.com>.
It is rarely practical to do exhaustive comparisons on datasets of this
size.

The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.

This can be done in many ways.  Your suggestion of doing a map-side join is
a reasonable one, but it will be much slower than methods where you can
prune the comparisons.



On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by Jagat Singh <ja...@gmail.com>.
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Re: Cartesian product in hadoop

Posted by Jagat Singh <ja...@gmail.com>.
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zh...@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>