You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by paul green <be...@hotmail.com> on 2015/07/27 22:00:36 UTC

eqijoin 1 field in dataset to 2 fields in another datasets using OR

HelloI use Pig at home (currently version 0.13.0) regularly on data sets that vary between 10's Megabytes and 10's Gigabytes. I wanted to be able to join two data sets together (ideally filtering). The main problem I am having and have not found an easily solution is:I want to join data set 1 to data set 2 like below.data1.txtid, name, job0001,john, manager0002,phil, deputydata2.txtid1, id2, id3, label0001,0002,0001,useful0005,0001,0001,useful0000,0010,0009,not usefulCode ProposaldatasetA = LOAD 'data1.txt' USING PigStorage(',') AS (fieldA1, fieldA2, fieldA3);datasetB = LOAD 'data2.txt' USING PigStorage(',') AS (fieldB1, fieldB2, fieldB3, fieldB4);joined = JOIN               datasetA BY fieldA1,              datasetB BY (fieldB1 OR fieldB2 OR fieldB3);DUMP joined;So essentially I want to join 1 column to n columns in the second data set where they are equal. I am not after a partial join but an exact join. Is there a feature already in the language to do this, if not, would it be possible to request such a feature?Thanks. 		 	   		  

Re: eqijoin 1 field in dataset to 2 fields in another datasets using OR

Posted by Arvind S <ar...@gmail.com>.
Suggestion : you can create a join for each column individually ..and then
union the result.. ??

http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#UNION

*Cheers !!*
Arvind

On Tue, Jul 28, 2015 at 1:30 AM, paul green <be...@hotmail.com> wrote:

> HelloI use Pig at home (currently version 0.13.0) regularly on data sets
> that vary between 10's Megabytes and 10's Gigabytes. I wanted to be able to
> join two data sets together (ideally filtering). The main problem I am
> having and have not found an easily solution is:I want to join data set 1
> to data set 2 like below.data1.txtid, name, job0001,john, manager0002,phil,
> deputydata2.txtid1, id2, id3,
> label0001,0002,0001,useful0005,0001,0001,useful0000,0010,0009,not
> usefulCode ProposaldatasetA = LOAD 'data1.txt' USING PigStorage(',') AS
> (fieldA1, fieldA2, fieldA3);datasetB = LOAD 'data2.txt' USING
> PigStorage(',') AS (fieldB1, fieldB2, fieldB3, fieldB4);joined = JOIN
>          datasetA BY fieldA1,              datasetB BY (fieldB1 OR fieldB2
> OR fieldB3);DUMP joined;So essentially I want to join 1 column to n columns
> in the second data set where they are equal. I am not after a partial join
> but an exact join. Is there a feature already in the language to do this,
> if not, would it be possible to request such a feature?Thanks.
>