You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Anze <an...@volja.net> on 2010/11/04 10:42:03 UTC
dissecting tuples
Hi all!
I have a problem that I can't find solution to... Hope someone can shed some
light. :)
-----
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
-----
(where $1 is a chararray)
I would like to generate this relation from it:
-----
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----
Can this be done?
More background: what I would actually like to do is inner join on two
relations, one as specified above (relation X) and the other that has
'a','b','c','d'... as values:
-----
grunt> dump Y;
(a,aaa)
(b,bbb)
(c,ccc)
...
-----
So this is the end result I am looking for:
(1,a,a,aaa)
(1,b,b,bbb)
(1,c,c,ccc)
(2,d,d,ddd)
(2,a,a,aaa)
(3,c,c,ccc)
One idea: I could make a cross join and keep only records (by using filter +
matches) where Y.$0 is contained in X.$1. But that seems very inefficient to
me. Is there a better way?
Thanks for any pointers,
Anze
Re: dissecting tuples
Posted by Anze <an...@volja.net>.
Doh... just found TOKENIZE() a minute after posting... :(
---
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
grunt> A = foreach X generate $0, TOKENIZE(REPLACE($1,'-','*');
grunt> B = foreach A generate $0, flatten($1);
grunt> dump B;
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----
Of course, I need to use PiggyBank because of REPLACE(), but that's ok. If
there's a better solution however, please let me know.
Thanks for listening and sorry for the noise - hope it helps someone else too.
:)
Anze
On Thursday 04 November 2010, Anze wrote:
> Hi all!
>
> I have a problem that I can't find solution to... Hope someone can shed
> some light. :)
>
> -----
> grunt> dump X;
> (1,a-b-c)
> (2,d-a)
> (3,c)
> -----
> (where $1 is a chararray)
>
> I would like to generate this relation from it:
> -----
> (1,a)
> (1,b)
> (1,c)
> (2,d)
> (2,a)
> (3,c)
> -----
>
> Can this be done?
>
> More background: what I would actually like to do is inner join on two
> relations, one as specified above (relation X) and the other that has
> 'a','b','c','d'... as values:
> -----
> grunt> dump Y;
> (a,aaa)
> (b,bbb)
> (c,ccc)
> ...
> -----
> So this is the end result I am looking for:
> (1,a,a,aaa)
> (1,b,b,bbb)
> (1,c,c,ccc)
> (2,d,d,ddd)
> (2,a,a,aaa)
> (3,c,c,ccc)
>
> One idea: I could make a cross join and keep only records (by using filter
> + matches) where Y.$0 is contained in X.$1. But that seems very
> inefficient to me. Is there a better way?
>
> Thanks for any pointers,
>
> Anze
Re: dissecting tuples
Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.
register piggybank.jar;
X1 = FOREACH X GENERATE $0 as f1, org.apache.pig.piggybank.evaluation.string.REPLACE($1,'-',',') as temp;
X2 = FOREACH X1 GENERATE f1, FLATTEN(TOKENIZE(temp)) as (f2);
Y2 = FOREACH Y GENERATE $0 as f1, $1 as f2;
Joined = JOIN X2 BY f2, Y BY f1 PARALLEL <your-parallel-value>;
Final = FOREACH Joined GENERATE
X2::f1 as f1,
X2::f2 as f2,
Y2::f2 as f3;
Dump Final;
-@nkur
On 11/4/10 3:12 PM, "Anze" <an...@volja.net> wrote:
Hi all!
I have a problem that I can't find solution to... Hope someone can shed some
light. :)
-----
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
-----
(where $1 is a chararray)
I would like to generate this relation from it:
-----
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----
Can this be done?
More background: what I would actually like to do is inner join on two
relations, one as specified above (relation X) and the other that has
'a','b','c','d'... as values:
-----
grunt> dump Y;
(a,aaa)
(b,bbb)
(c,ccc)
...
-----
So this is the end result I am looking for:
(1,a,a,aaa)
(1,b,b,bbb)
(1,c,c,ccc)
(2,d,d,ddd)
(2,a,a,aaa)
(3,c,c,ccc)
One idea: I could make a cross join and keep only records (by using filter +
matches) where Y.$0 is contained in X.$1. But that seems very inefficient to
me. Is there a better way?
Thanks for any pointers,
Anze