You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Christian <ch...@gmail.com> on 2010/06/12 15:20:44 UTC

How to find all possible permutations from a bag

Hello, this is my first contact with Pig and its community ;-)

I need to generate all the possible permutations from a bag.

Let me explain it with examples:

A = LOAD 'data' AS f1:chararray;

DUMP A;
('A')
('B')
('C')

I can have all the possible combinations easily with CROSS:

B = FOREACH A GENERATE $0 AS v1;
C = FOREACH A GENERATE $0 AS v2;

D = CROSS B, C;
DUMP D;
('A', 'A')
('A', 'B')
('A', 'C')
('B', 'A')
('B', 'B')
('B', 'C')
('C', 'A')
('C', 'B')
('C', 'C')

But what I need are the permutations. The result I want to obtain is
something like:

DUMP R;
('A', 'A')
('A', 'B')
('A', 'C')
('B', 'B')
('B', 'C')
('C', 'C')

My first idea to solve that was to generate de CROSS and then FILTER like:

R = FILTER D BY $0 < $1;

It works but I would like to know if there is a better way to do this
without having to use string comparison and assume that only one field is
used. For example a real scenario would look like:

DUMP A;
('A1', 'A2')
('B1', 'B2')

DUMP R;
('A1', 'A2', 'A1', 'A2')
('A1', 'A2', 'B1', 'B2')
('B1', 'B2', 'B1', 'B2')

Thank you in advance.
Christian

Re: How to find all possible permutations from a bag

Posted by hc busy <hc...@gmail.com>.
heh, I want n*(n-1)/2 too... Maybe someone out there has an UDF that does
this after a group.

:)

On Wed, Jun 16, 2010 at 8:30 AM, Christian <ch...@gmail.com> wrote:

> Thanks hc busy,
>
> E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2;
>
> F = distinct E;
>
>
> Interesting, I didn't think about that.
>
> However, I think I can see a problem with this as well. If all 'A's are not
> > distinct, then you might need to generate unique Id for each row
> >
>
> Luckly this is not my case. I'm only correlating different terms.
>
> In any case, what I see is that the CROSS is too much expensive, I don't
> know if the following filtering to the CROSS is coupled to the same MR or
> first all the crosses are generated and after that in other MR the data is
> filtered. It has big implications in the performance, the CROSS generates
> n^2 tuples but the permutations that I want after the filtering are n * (n
> -
> 1) / 2. I think I should expose this in more detail in other thread later
> on, I'm having some problems with JOIN's too because of that.
>
> Thanks again.
>
>
> >
> > On Sat, Jun 12, 2010 at 6:20 AM, Christian <ch...@gmail.com> wrote:
> >
> > > Hello, this is my first contact with Pig and its community ;-)
> > >
> > > I need to generate all the possible permutations from a bag.
> > >
> > > Let me explain it with examples:
> > >
> > > A = LOAD 'data' AS f1:chararray;
> > >
> > > DUMP A;
> > > ('A')
> > > ('B')
> > > ('C')
> > >
> > > I can have all the possible combinations easily with CROSS:
> > >
> > > B = FOREACH A GENERATE $0 AS v1;
> > > C = FOREACH A GENERATE $0 AS v2;
> > >
> > > D = CROSS B, C;
> > > DUMP D;
> > > ('A', 'A')
> > > ('A', 'B')
> > > ('A', 'C')
> > > ('B', 'A')
> > > ('B', 'B')
> > > ('B', 'C')
> > > ('C', 'A')
> > > ('C', 'B')
> > > ('C', 'C')
> > >
> > > But what I need are the permutations. The result I want to obtain is
> > > something like:
> > >
> > > DUMP R;
> > > ('A', 'A')
> > > ('A', 'B')
> > > ('A', 'C')
> > > ('B', 'B')
> > > ('B', 'C')
> > > ('C', 'C')
> > >
> > > My first idea to solve that was to generate de CROSS and then FILTER
> > like:
> > >
> > > R = FILTER D BY $0 < $1;
> > >
> > > It works but I would like to know if there is a better way to do this
> > > without having to use string comparison and assume that only one field
> is
> > > used. For example a real scenario would look like:
> > >
> > > DUMP A;
> > > ('A1', 'A2')
> > > ('B1', 'B2')
> > >
> > > DUMP R;
> > > ('A1', 'A2', 'A1', 'A2')
> > > ('A1', 'A2', 'B1', 'B2')
> > > ('B1', 'B2', 'B1', 'B2')
> > >
> > > Thank you in advance.
> > > Christian
> > >
> >
>

Re: How to find all possible permutations from a bag

Posted by Christian <ch...@gmail.com>.
Thanks hc busy,

E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2;

F = distinct E;


Interesting, I didn't think about that.

However, I think I can see a problem with this as well. If all 'A's are not
> distinct, then you might need to generate unique Id for each row
>

Luckly this is not my case. I'm only correlating different terms.

In any case, what I see is that the CROSS is too much expensive, I don't
know if the following filtering to the CROSS is coupled to the same MR or
first all the crosses are generated and after that in other MR the data is
filtered. It has big implications in the performance, the CROSS generates
n^2 tuples but the permutations that I want after the filtering are n * (n -
1) / 2. I think I should expose this in more detail in other thread later
on, I'm having some problems with JOIN's too because of that.

Thanks again.


>
> On Sat, Jun 12, 2010 at 6:20 AM, Christian <ch...@gmail.com> wrote:
>
> > Hello, this is my first contact with Pig and its community ;-)
> >
> > I need to generate all the possible permutations from a bag.
> >
> > Let me explain it with examples:
> >
> > A = LOAD 'data' AS f1:chararray;
> >
> > DUMP A;
> > ('A')
> > ('B')
> > ('C')
> >
> > I can have all the possible combinations easily with CROSS:
> >
> > B = FOREACH A GENERATE $0 AS v1;
> > C = FOREACH A GENERATE $0 AS v2;
> >
> > D = CROSS B, C;
> > DUMP D;
> > ('A', 'A')
> > ('A', 'B')
> > ('A', 'C')
> > ('B', 'A')
> > ('B', 'B')
> > ('B', 'C')
> > ('C', 'A')
> > ('C', 'B')
> > ('C', 'C')
> >
> > But what I need are the permutations. The result I want to obtain is
> > something like:
> >
> > DUMP R;
> > ('A', 'A')
> > ('A', 'B')
> > ('A', 'C')
> > ('B', 'B')
> > ('B', 'C')
> > ('C', 'C')
> >
> > My first idea to solve that was to generate de CROSS and then FILTER
> like:
> >
> > R = FILTER D BY $0 < $1;
> >
> > It works but I would like to know if there is a better way to do this
> > without having to use string comparison and assume that only one field is
> > used. For example a real scenario would look like:
> >
> > DUMP A;
> > ('A1', 'A2')
> > ('B1', 'B2')
> >
> > DUMP R;
> > ('A1', 'A2', 'A1', 'A2')
> > ('A1', 'A2', 'B1', 'B2')
> > ('B1', 'B2', 'B1', 'B2')
> >
> > Thank you in advance.
> > Christian
> >
>

Re: How to find all possible permutations from a bag

Posted by hc busy <hc...@gmail.com>.
You can't really get away from string comparisons:


B = FOREACH A GENERATE $0 AS v1;
C = FOREACH A GENERATE $0 AS v2;
D = CROSS B, C;
dump D; -- generates all permutations of the two key fields.
E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2;
F = distinct E;
dump F; -- results in combinations

However, I think I can see a problem with this as well. If all 'A's are not
distinct, then you might need to generate unique Id for each row


B = FOREACH A GENERATE $0 AS v1, sequential() as extra1;
C = FOREACH A GENERATE $0 AS v2, sequential() as extra2;
D = CROSS B, C;
D = filter D by extra==extra2;
E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2;
F = distinct E;

This gives the actual results if you are solving the combinatoric problem of
5 "A's" 6 "B's" and 7 "C's" how many combinations and permutations.



On Sat, Jun 12, 2010 at 6:20 AM, Christian <ch...@gmail.com> wrote:

> Hello, this is my first contact with Pig and its community ;-)
>
> I need to generate all the possible permutations from a bag.
>
> Let me explain it with examples:
>
> A = LOAD 'data' AS f1:chararray;
>
> DUMP A;
> ('A')
> ('B')
> ('C')
>
> I can have all the possible combinations easily with CROSS:
>
> B = FOREACH A GENERATE $0 AS v1;
> C = FOREACH A GENERATE $0 AS v2;
>
> D = CROSS B, C;
> DUMP D;
> ('A', 'A')
> ('A', 'B')
> ('A', 'C')
> ('B', 'A')
> ('B', 'B')
> ('B', 'C')
> ('C', 'A')
> ('C', 'B')
> ('C', 'C')
>
> But what I need are the permutations. The result I want to obtain is
> something like:
>
> DUMP R;
> ('A', 'A')
> ('A', 'B')
> ('A', 'C')
> ('B', 'B')
> ('B', 'C')
> ('C', 'C')
>
> My first idea to solve that was to generate de CROSS and then FILTER like:
>
> R = FILTER D BY $0 < $1;
>
> It works but I would like to know if there is a better way to do this
> without having to use string comparison and assume that only one field is
> used. For example a real scenario would look like:
>
> DUMP A;
> ('A1', 'A2')
> ('B1', 'B2')
>
> DUMP R;
> ('A1', 'A2', 'A1', 'A2')
> ('A1', 'A2', 'B1', 'B2')
> ('B1', 'B2', 'B1', 'B2')
>
> Thank you in advance.
> Christian
>