You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by "KALLURI, RAJESH K (AG/1000)" <ra...@monsanto.com> on 2013/04/19 06:17:49 UTC

Effective way to cross two relations

I have a relation of about 50000 tuples that I want to join to itself either by using a cross operator or something similar. Then I would be doing pair wise computation of half the matrix (avoiding comparing to self and duplicate).

I was wondering what the most effective way to do this,  below is some pseudo pig latin.


-- About 50,000 - 70,000 entries
a = LOAD 'part-r-00000.txt' USING PigStorage()
AS (id:long,  x:int, y:int);
-- Same as a , About 50,000 - 70,000 entries
b = LOAD 'part-r-00000.txt' USING PigStorage()
AS (id:long,  x:int, y:int);

jnd = join a by id , b by id;
-- filter comparisons to self and duplicates from the matrix
-- end up with 50000 X (50000-1)/2 entries
filter_self = filter jnd by a::id != b::id and a::id > b::id;

raw = foreach filter_self generate a::id as id1, b::id as id2, TOBAG(a::x, b::y) as z;
-- group pairs for comparison
grpd = group raw by (id1, id2);
-- calculate similarity between id1 and id2 based on a udf
prjctd = foreach grpd generate flatten(group), UDF(raw.z);

This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.

Re: Effective way to cross two relations

Posted by Sergey Goder <se...@gmail.com>.

I posted on this very same topic a few weeks ago with no response. It is
still an unresolved issue for me, so if anyone had any ideas it would be
greatly appreciated.

Interestingly enough I ran into issues right around the same size that you
are dealing with (50k rows) so I am wondering if it is an issue with how
Pig handles things. I'd recommend tuning some of the parameters that I
mention in my post (below) as it may help you complete the job.

http://search-hadoop.com/m/kJghFzruCA1/nested+cross&subj=Moving+Cross+of+Large+Data+to+be+Nested


On Thu, Apr 18, 2013 at 9:17 PM, KALLURI, RAJESH K (AG/1000) <
rajesh.k.kalluri@monsanto.com> wrote:

> I have a relation of about 50000 tuples that I want to join to itself
> either by using a cross operator or something similar. Then I would be
> doing pair wise computation of half the matrix (avoiding comparing to self
> and duplicate).
>
> I was wondering what the most effective way to do this,  below is some
> pseudo pig latin.
>
>
> -- About 50,000 - 70,000 entries
> a = LOAD 'part-r-00000.txt' USING PigStorage()
> AS (id:long,  x:int, y:int);
> -- Same as a , About 50,000 - 70,000 entries
> b = LOAD 'part-r-00000.txt' USING PigStorage()
> AS (id:long,  x:int, y:int);
>
> jnd = join a by id , b by id;
> -- filter comparisons to self and duplicates from the matrix
> -- end up with 50000 X (50000-1)/2 entries
> filter_self = filter jnd by a::id != b::id and a::id > b::id;
>
> raw = foreach filter_self generate a::id as id1, b::id as id2, TOBAG(a::x,
> b::y) as z;
> -- group pairs for comparison
> grpd = group raw by (id1, id2);
> -- calculate similarity between id1 and id2 based on a udf
> prjctd = foreach grpd generate flatten(group), UDF(raw.z);
>
> This e-mail message may contain privileged and/or confidential
> information, and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other
> use of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>