You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Anze <an...@volja.net> on 2010/11/04 10:42:03 UTC

dissecting tuples

Hi all!

I have a problem that I can't find solution to... Hope someone can shed some 
light. :)

-----
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
-----
(where $1 is a chararray)

I would like to generate this relation from it:
-----
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----

Can this be done?

More background: what I would actually like to do is inner join on two 
relations, one as specified above (relation X) and the other that has 
'a','b','c','d'... as values:
-----
grunt> dump Y;
(a,aaa)
(b,bbb)
(c,ccc)
...
-----
So this is the end result I am looking for:
(1,a,a,aaa)
(1,b,b,bbb)
(1,c,c,ccc)
(2,d,d,ddd)
(2,a,a,aaa)
(3,c,c,ccc)

One idea: I could make a cross join and keep only records (by using filter + 
matches) where Y.$0 is contained in X.$1. But that seems very inefficient to 
me. Is there a better way?

Thanks for any pointers,

Anze

Re: dissecting tuples

Posted by Anze <an...@volja.net>.

Doh... just found TOKENIZE() a minute after posting... :(

---
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
grunt> A = foreach X generate $0, TOKENIZE(REPLACE($1,'-','*');
grunt> B = foreach A generate $0, flatten($1);
grunt> dump B;
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----

Of course, I need to use PiggyBank because of REPLACE(), but that's ok. If 
there's a better solution however, please let me know.

Thanks for listening and sorry for the noise - hope it helps someone else too. 
:)

Anze


On Thursday 04 November 2010, Anze wrote:
> Hi all!
> 
> I have a problem that I can't find solution to... Hope someone can shed
> some light. :)
> 
> -----
> grunt> dump X;
> (1,a-b-c)
> (2,d-a)
> (3,c)
> -----
> (where $1 is a chararray)
> 
> I would like to generate this relation from it:
> -----
> (1,a)
> (1,b)
> (1,c)
> (2,d)
> (2,a)
> (3,c)
> -----
> 
> Can this be done?
> 
> More background: what I would actually like to do is inner join on two
> relations, one as specified above (relation X) and the other that has
> 'a','b','c','d'... as values:
> -----
> grunt> dump Y;
> (a,aaa)
> (b,bbb)
> (c,ccc)
> ...
> -----
> So this is the end result I am looking for:
> (1,a,a,aaa)
> (1,b,b,bbb)
> (1,c,c,ccc)
> (2,d,d,ddd)
> (2,a,a,aaa)
> (3,c,c,ccc)
> 
> One idea: I could make a cross join and keep only records (by using filter
> + matches) where Y.$0 is contained in X.$1. But that seems very
> inefficient to me. Is there a better way?
> 
> Thanks for any pointers,
> 
> Anze

Re: dissecting tuples

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

register piggybank.jar;

X1 = FOREACH X GENERATE $0 as f1, org.apache.pig.piggybank.evaluation.string.REPLACE($1,'-',',') as temp;
X2 = FOREACH X1 GENERATE f1, FLATTEN(TOKENIZE(temp)) as (f2);
Y2 = FOREACH Y GENERATE $0 as f1, $1 as f2;
Joined = JOIN X2 BY f2, Y BY f1 PARALLEL <your-parallel-value>;
Final = FOREACH Joined GENERATE
            X2::f1 as f1,
            X2::f2 as f2,
            Y2::f2 as f3;
Dump Final;

-@nkur

On 11/4/10 3:12 PM, "Anze" <an...@volja.net> wrote:

Hi all!

I have a problem that I can't find solution to... Hope someone can shed some
light. :)

-----
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
-----
(where $1 is a chararray)

I would like to generate this relation from it:
-----
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----

Can this be done?

More background: what I would actually like to do is inner join on two
relations, one as specified above (relation X) and the other that has
'a','b','c','d'... as values:
-----
grunt> dump Y;
(a,aaa)
(b,bbb)
(c,ccc)
...
-----
So this is the end result I am looking for:
(1,a,a,aaa)
(1,b,b,bbb)
(1,c,c,ccc)
(2,d,d,ddd)
(2,a,a,aaa)
(3,c,c,ccc)

One idea: I could make a cross join and keep only records (by using filter +
matches) where Y.$0 is contained in X.$1. But that seems very inefficient to
me. Is there a better way?

Thanks for any pointers,

Anze