You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Thomas Bach <th...@students.uni-mainz.de> on 2012/12/14 11:11:43 UTC
Join Multiple Relations by Different Fields
Hi,
Say I have three files `data1`, `data2` and `assocs`:
$ cat data1
key1,foo
key2,bar
$ cat data2
key3,braz
key4,froz
$ cat assoc
key1,key3
key2,key4
I load these files via
$ pig -b -p debug=WARN -x local
Warning: $HADOOP_HOME is deprecated.
Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
Logging error messages to: /home/vince/tmp/pig_1355407390166.log
Connecting to hadoop file system at: file:///
grunt> data1 = load 'data1' as (key: chararray, val: chararray);
grunt> data2 = load 'data2' as (key: chararray, val: chararray);
grunt> assoc = load 'assoc' as (key1: chararray, key2: chararray);
What I want is a relation that looks like:
(foo, braz)
(bar, froz)
That is
data1_val, data1_key <-> assoc_key1, assoc_key2 <-> data2_key, data2_val
So my first assumption was to do a join on data1, assoc first and then
on the resulting relation with data2. Anyways, doing a
A = join data1 by key, assoc by key1;
dump A;
Doesn't yield any results. Is this a bug or am I doing something
conceptually wrong?
Regards,
Thomas Bach.
Re: Join Multiple Relations by Different Fields
Posted by Jonathan Coveney <jc...@gmail.com>.
it's a little confusing, but the following is a tuple: (key1,foo,)
it's just not the tuple you want. it is a tuple where the first field is
"key1,foo" and the second field is null. The printing makes this ambiguous
2012/12/14 Thomas Bach <th...@students.uni-mainz.de>
> (key1,foo,)
Re: Join Multiple Relations by Different Fields
Posted by Thomas Bach <th...@students.uni-mainz.de>.
Hi all,
I got a hint via StackOverflow[1] the problem was the missing
delimiter definition
On Fri, Dec 14, 2012 at 11:11:43AM +0100, Thomas Bach wrote:
> grunt> data1 = load 'data1' as (key: chararray, val: chararray);
> grunt> data2 = load 'data2' as (key: chararray, val: chararray);
> grunt> assoc = load 'assoc' as (key1: chararray, key2: chararray);
this should read
data1 = load 'data1' using PigStorage(',') as (key: chararray, val: chararray);
data2 = load 'data2' using PigStorage(',') as (key: chararray, val: chararray);
assoc = load 'assoc' using PigStorage(',') as (key1: chararray, key2: chararray);
I got confused because the original statement yielded
grunt> dump data1;
(key1,foo,)
(key2,bar,)
So I took for granted that this is a tupleā¦
Sorry for the noise,
Thomas Bach.
Footnotes:
[1] http://stackoverflow.com/questions/13861570/join-multiple-relations-by-different-fields