You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Iván de Prado <iv...@gmail.com> on 2008/06/11 16:16:20 UTC

Outer joins

Lets suppose 1.txt is:

1       1       1
2       1       1
3       1       1
4       1       1
5       1       1

And 2.txt is:

4       2       2
5       2       2
6       2       2
7       2       2

The script: 

A = LOAD 'ivan/1.txt' USING PigStorage();
B = LOAD 'ivan/2.txt' USING PigStorage();
C = COGROUP A by $0, B by $0;
D = FOREACH C GENERATE flatten(A), flatten(B);

dump C:

(1, {(1, 1, 1)}, {})
(2, {(2, 1, 1)}, {})
(3, {(3, 1, 1)}, {})
(4, {(4, 1, 1)}, {(4, 2, 2)})
(5, {(5, 1, 1)}, {(5, 2, 2)})
(6, {}, {(6, 2, 2)})
(7, {}, {(7, 2, 2)})
(8, {}, {(8, 2, 2)})

dump D;

(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)

But this is not the result that I expected. I would like to obtain this
result:

(1, 1, 1, 1, '', '', '')
(2, 2, 1, 1, '', '', '')
(2, 3, 1, 1, '', '', '')
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
('','','',6, 2, 2)
('','','',7, 2, 2)
('','','',8, 2, 2)

This is the expected result if you do an outer join in SQL. How can I
modify the script to get this result?  (apart of doing 3 FILTERs over C
and then a UNION)

Thanks and regards,
Iván de Prado
www.ivanprado.es


Re: Outer joins

Posted by Iván de Prado <iv...@gmail.com>.
I think the reason is that the flatten is a cross product. So {} crossed
with something is nothing. 

Iván de Prado
www.ivanprado.es 

El mié, 11-06-2008 a las 08:25 -0700, Prashanth Pappu escribió:
> But why doesn't
> 
> D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
> dump D;
> 
> give
> 
> (1, 1, 1 )
> (2, 1, 1 )
> (3, 1, 1 )
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> (6, 2, 2)
> (7, 2, 2)
> 
> FLATTEN({(1,1,1)}), FLATTEN({}) = {()} ?
> 
> Prashanth
> 
> 
> 
> 
> On Wed, Jun 11, 2008 at 7:39 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
> 
> > 3 foreachs and a union should get you what you want (I think that's what
> > you meant instead of filter).  It's a little nasty but I don't know of a way
> > around it.
> >
> > As for why pig does this, in order to realize SQL outer join semantics, it
> > is necessary to know what ought to be there for the table that doesn't have
> > a match.  So in your example, pig would need to know that both 1.txt and
> > 2.txt have 3 fields.  But pig, by default, does not have metadata.  So it
> > does not know.
> >
> > And it cannot figure it out by looking at existing rows.  This wouldn't
> > work in the case where one file was empty.  Also, unlike SQL, pig does not
> > place a constraint on your data that all tuples in a file be the same.  So
> > it would be legal for 1.txt to have a last line that was just '1   2'.  That
> > means pig can't guess what null columns it should append in the outer join.
> >
> > All that said, we are adding various ways for pig to interface with
> > metadata when it is available (though not require it).  Once we have that I
> > think we should support SQL outer join semantics when that is what the user
> > wants.
> >
> > Alan.
> >
> >
> > Iván de Prado wrote:
> >
> >> Lets suppose 1.txt is:
> >>
> >> 1       1       1
> >> 2       1       1
> >> 3       1       1
> >> 4       1       1
> >> 5       1       1
> >>
> >> And 2.txt is:
> >>
> >> 4       2       2
> >> 5       2       2
> >> 6       2       2
> >> 7       2       2
> >>
> >> The script:
> >> A = LOAD 'ivan/1.txt' USING PigStorage();
> >> B = LOAD 'ivan/2.txt' USING PigStorage();
> >> C = COGROUP A by $0, B by $0;
> >> D = FOREACH C GENERATE flatten(A), flatten(B);
> >>
> >> dump C:
> >>
> >> (1, {(1, 1, 1)}, {})
> >> (2, {(2, 1, 1)}, {})
> >> (3, {(3, 1, 1)}, {})
> >> (4, {(4, 1, 1)}, {(4, 2, 2)})
> >> (5, {(5, 1, 1)}, {(5, 2, 2)})
> >> (6, {}, {(6, 2, 2)})
> >> (7, {}, {(7, 2, 2)})
> >> (8, {}, {(8, 2, 2)})
> >>
> >> dump D;
> >>
> >> (4, 1, 1, 4, 2, 2)
> >> (5, 1, 1, 5, 2, 2)
> >>
> >> But this is not the result that I expected. I would like to obtain this
> >> result:
> >>
> >> (1, 1, 1, 1, '', '', '')
> >> (2, 2, 1, 1, '', '', '')
> >> (2, 3, 1, 1, '', '', '')
> >> (4, 1, 1, 4, 2, 2)
> >> (5, 1, 1, 5, 2, 2)
> >> ('','','',6, 2, 2)
> >> ('','','',7, 2, 2)
> >> ('','','',8, 2, 2)
> >>
> >> This is the expected result if you do an outer join in SQL. How can I
> >> modify the script to get this result?  (apart of doing 3 FILTERs over C
> >> and then a UNION)
> >>
> >> Thanks and regards,
> >> Iván de Prado
> >> www.ivanprado.es
> >>
> >>
> >>
> >


Re: Outer joins

Posted by Prashanth Pappu <pr...@conviva.com>.
But why doesn't

D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
dump D;

give

(1, 1, 1 )
(2, 1, 1 )
(3, 1, 1 )
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
(6, 2, 2)
(7, 2, 2)

FLATTEN({(1,1,1)}), FLATTEN({}) = {()} ?

Prashanth




On Wed, Jun 11, 2008 at 7:39 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> 3 foreachs and a union should get you what you want (I think that's what
> you meant instead of filter).  It's a little nasty but I don't know of a way
> around it.
>
> As for why pig does this, in order to realize SQL outer join semantics, it
> is necessary to know what ought to be there for the table that doesn't have
> a match.  So in your example, pig would need to know that both 1.txt and
> 2.txt have 3 fields.  But pig, by default, does not have metadata.  So it
> does not know.
>
> And it cannot figure it out by looking at existing rows.  This wouldn't
> work in the case where one file was empty.  Also, unlike SQL, pig does not
> place a constraint on your data that all tuples in a file be the same.  So
> it would be legal for 1.txt to have a last line that was just '1   2'.  That
> means pig can't guess what null columns it should append in the outer join.
>
> All that said, we are adding various ways for pig to interface with
> metadata when it is available (though not require it).  Once we have that I
> think we should support SQL outer join semantics when that is what the user
> wants.
>
> Alan.
>
>
> Iván de Prado wrote:
>
>> Lets suppose 1.txt is:
>>
>> 1       1       1
>> 2       1       1
>> 3       1       1
>> 4       1       1
>> 5       1       1
>>
>> And 2.txt is:
>>
>> 4       2       2
>> 5       2       2
>> 6       2       2
>> 7       2       2
>>
>> The script:
>> A = LOAD 'ivan/1.txt' USING PigStorage();
>> B = LOAD 'ivan/2.txt' USING PigStorage();
>> C = COGROUP A by $0, B by $0;
>> D = FOREACH C GENERATE flatten(A), flatten(B);
>>
>> dump C:
>>
>> (1, {(1, 1, 1)}, {})
>> (2, {(2, 1, 1)}, {})
>> (3, {(3, 1, 1)}, {})
>> (4, {(4, 1, 1)}, {(4, 2, 2)})
>> (5, {(5, 1, 1)}, {(5, 2, 2)})
>> (6, {}, {(6, 2, 2)})
>> (7, {}, {(7, 2, 2)})
>> (8, {}, {(8, 2, 2)})
>>
>> dump D;
>>
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>>
>> But this is not the result that I expected. I would like to obtain this
>> result:
>>
>> (1, 1, 1, 1, '', '', '')
>> (2, 2, 1, 1, '', '', '')
>> (2, 3, 1, 1, '', '', '')
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>> ('','','',6, 2, 2)
>> ('','','',7, 2, 2)
>> ('','','',8, 2, 2)
>>
>> This is the expected result if you do an outer join in SQL. How can I
>> modify the script to get this result?  (apart of doing 3 FILTERs over C
>> and then a UNION)
>>
>> Thanks and regards,
>> Iván de Prado
>> www.ivanprado.es
>>
>>
>>
>

Re: Outer joins

Posted by Iván de Prado <iv...@gmail.com>.
I understand your points. 

I was thinking in a posible solution that is not currently supported by
Pig:

D = FOREACH C GENERATE (ARITY(A) == 0 ? '','','' : flatten(A)), (ARITY(B) == 0 ? '','','' : flatten(B));

That is not currently supported by Pig. Have the proposal sense? Would
it be difficult to add this possibility to the BindCond? 

Thanks, 
Iván de Prado
www.ivanprado.es 

El mié, 11-06-2008 a las 07:39 -0700, Alan Gates escribió:
> 3 foreachs and a union should get you what you want (I think that's what 
> you meant instead of filter).  It's a little nasty but I don't know of a 
> way around it.
> 
> As for why pig does this, in order to realize SQL outer join semantics, 
> it is necessary to know what ought to be there for the table that 
> doesn't have a match.  So in your example, pig would need to know that 
> both 1.txt and 2.txt have 3 fields.  But pig, by default, does not have 
> metadata.  So it does not know.
> 
> And it cannot figure it out by looking at existing rows.  This wouldn't 
> work in the case where one file was empty.  Also, unlike SQL, pig does 
> not place a constraint on your data that all tuples in a file be the 
> same.  So it would be legal for 1.txt to have a last line that was just 
> '1   2'.  That means pig can't guess what null columns it should append 
> in the outer join.
> 
> All that said, we are adding various ways for pig to interface with 
> metadata when it is available (though not require it).  Once we have 
> that I think we should support SQL outer join semantics when that is 
> what the user wants.
> 
> Alan.
> 
> Iván de Prado wrote:
> > Lets suppose 1.txt is:
> >
> > 1       1       1
> > 2       1       1
> > 3       1       1
> > 4       1       1
> > 5       1       1
> >
> > And 2.txt is:
> >
> > 4       2       2
> > 5       2       2
> > 6       2       2
> > 7       2       2
> >
> > The script: 
> >
> > A = LOAD 'ivan/1.txt' USING PigStorage();
> > B = LOAD 'ivan/2.txt' USING PigStorage();
> > C = COGROUP A by $0, B by $0;
> > D = FOREACH C GENERATE flatten(A), flatten(B);
> >
> > dump C:
> >
> > (1, {(1, 1, 1)}, {})
> > (2, {(2, 1, 1)}, {})
> > (3, {(3, 1, 1)}, {})
> > (4, {(4, 1, 1)}, {(4, 2, 2)})
> > (5, {(5, 1, 1)}, {(5, 2, 2)})
> > (6, {}, {(6, 2, 2)})
> > (7, {}, {(7, 2, 2)})
> > (8, {}, {(8, 2, 2)})
> >
> > dump D;
> >
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> >
> > But this is not the result that I expected. I would like to obtain this
> > result:
> >
> > (1, 1, 1, 1, '', '', '')
> > (2, 2, 1, 1, '', '', '')
> > (2, 3, 1, 1, '', '', '')
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> > ('','','',6, 2, 2)
> > ('','','',7, 2, 2)
> > ('','','',8, 2, 2)
> >
> > This is the expected result if you do an outer join in SQL. How can I
> > modify the script to get this result?  (apart of doing 3 FILTERs over C
> > and then a UNION)
> >
> > Thanks and regards,
> > Iván de Prado
> > www.ivanprado.es
> >
> >   


Re: Outer joins

Posted by Alan Gates <ga...@yahoo-inc.com>.
3 foreachs and a union should get you what you want (I think that's what 
you meant instead of filter).  It's a little nasty but I don't know of a 
way around it.

As for why pig does this, in order to realize SQL outer join semantics, 
it is necessary to know what ought to be there for the table that 
doesn't have a match.  So in your example, pig would need to know that 
both 1.txt and 2.txt have 3 fields.  But pig, by default, does not have 
metadata.  So it does not know.

And it cannot figure it out by looking at existing rows.  This wouldn't 
work in the case where one file was empty.  Also, unlike SQL, pig does 
not place a constraint on your data that all tuples in a file be the 
same.  So it would be legal for 1.txt to have a last line that was just 
'1   2'.  That means pig can't guess what null columns it should append 
in the outer join.

All that said, we are adding various ways for pig to interface with 
metadata when it is available (though not require it).  Once we have 
that I think we should support SQL outer join semantics when that is 
what the user wants.

Alan.

Iván de Prado wrote:
> Lets suppose 1.txt is:
>
> 1       1       1
> 2       1       1
> 3       1       1
> 4       1       1
> 5       1       1
>
> And 2.txt is:
>
> 4       2       2
> 5       2       2
> 6       2       2
> 7       2       2
>
> The script: 
>
> A = LOAD 'ivan/1.txt' USING PigStorage();
> B = LOAD 'ivan/2.txt' USING PigStorage();
> C = COGROUP A by $0, B by $0;
> D = FOREACH C GENERATE flatten(A), flatten(B);
>
> dump C:
>
> (1, {(1, 1, 1)}, {})
> (2, {(2, 1, 1)}, {})
> (3, {(3, 1, 1)}, {})
> (4, {(4, 1, 1)}, {(4, 2, 2)})
> (5, {(5, 1, 1)}, {(5, 2, 2)})
> (6, {}, {(6, 2, 2)})
> (7, {}, {(7, 2, 2)})
> (8, {}, {(8, 2, 2)})
>
> dump D;
>
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
>
> But this is not the result that I expected. I would like to obtain this
> result:
>
> (1, 1, 1, 1, '', '', '')
> (2, 2, 1, 1, '', '', '')
> (2, 3, 1, 1, '', '', '')
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> ('','','',6, 2, 2)
> ('','','',7, 2, 2)
> ('','','',8, 2, 2)
>
> This is the expected result if you do an outer join in SQL. How can I
> modify the script to get this result?  (apart of doing 3 FILTERs over C
> and then a UNION)
>
> Thanks and regards,
> Iván de Prado
> www.ivanprado.es
>
>   

Re: Outer joins

Posted by Iván de Prado <iv...@gmail.com>.
Thanks!! That works and is enough for solving this issue. 

Another thing. I have notice that if you have UDFs whose package name
have some reserved words (like com.mycompany.mipackage.eval.Func) Grunt
fails. I had had to change two times my package names because of that.
One because the word 'eval' and the other because the word 'load'. 

Thanks, 
Iván de Prado
www.ivanprado.es


El mié, 11-06-2008 a las 10:36 -0500, Tanton Gibbs escribió:
> I created a simple UDF that defines an "Identity" tuple to allow me to
> use constants.  If there is a built-in way you can use that, instead.
> 
> Here is the UDF:
> import org.apache.pig.EvalFunc;
> import org.apache.pig.data.Tuple;
> import java.io.IOException;
> 
> public class IdentityTuple extends EvalFunc<Tuple> {
> 
>   @Override
>   public void exec(Tuple input, Tuple output) throws IOException {
>     output.copyFrom(input);
>   }
> }
> 
> Here is the script:
> register Identity.jar
> A = LOAD 'mytestA.txt' USING PigStorage();
> B = LOAD 'mytestB.txt' USING PigStorage();
> C = COGROUP A BY $0, B BY $0;
> D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? IdentityTuple('',
> '', '') : A)), flatten(((COUNT(B) == '0') ? IdentityTuple('', '', '')
> : B));
> dump D;
> 
> Here is the output:
> (1, 1, 1, , , )
> (2, 1, 1, , , )
> (3, 1, 1, , , )
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> (, , , 6, 2, 2)
> (, , , 7, 2, 2)
> 
> On Wed, Jun 11, 2008 at 10:07 AM, Tanton Gibbs <ta...@gmail.com> wrote:
> > You can almost do it, but I can't seem to figure out how to generate a
> > constant tuple.
> >
> > Here is code that works and gets close to what you want, but not quite:
> >
> > $ cat pigscript.test
> > A = LOAD 'mytestA.txt' USING PigStorage();
> > B = LOAD 'mytestB.txt' USING PigStorage();
> > C = COGROUP A BY $0, B BY $0;
> > D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? '' : A)),
> > flatten(((COUNT(B) == '0') ? '' : B));
> > dump D;
> >
> > (1, 1, 1, )
> > (2, 1, 1, )
> > (3, 1, 1, )
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> > (, 6, 2, 2)
> > (, 7, 2, 2)
> >
> > Now, if  I could figure out how to use a constant ('', '', '') instead
> > of the single '', then I would have what you are looking for, but I
> > can't seem to get that to work.
> >
> > Ideas?
> >
> > On Wed, Jun 11, 2008 at 9:16 AM, Iván de Prado
> > <iv...@gmail.com> wrote:
> >> Lets suppose 1.txt is:
> >>
> >> 1       1       1
> >> 2       1       1
> >> 3       1       1
> >> 4       1       1
> >> 5       1       1
> >>
> >> And 2.txt is:
> >>
> >> 4       2       2
> >> 5       2       2
> >> 6       2       2
> >> 7       2       2
> >>
> >> The script:
> >>
> >> A = LOAD 'ivan/1.txt' USING PigStorage();
> >> B = LOAD 'ivan/2.txt' USING PigStorage();
> >> C = COGROUP A by $0, B by $0;
> >> D = FOREACH C GENERATE flatten(A), flatten(B);
> >>
> >> dump C:
> >>
> >> (1, {(1, 1, 1)}, {})
> >> (2, {(2, 1, 1)}, {})
> >> (3, {(3, 1, 1)}, {})
> >> (4, {(4, 1, 1)}, {(4, 2, 2)})
> >> (5, {(5, 1, 1)}, {(5, 2, 2)})
> >> (6, {}, {(6, 2, 2)})
> >> (7, {}, {(7, 2, 2)})
> >> (8, {}, {(8, 2, 2)})
> >>
> >> dump D;
> >>
> >> (4, 1, 1, 4, 2, 2)
> >> (5, 1, 1, 5, 2, 2)
> >>
> >> But this is not the result that I expected. I would like to obtain this
> >> result:
> >>
> >> (1, 1, 1, 1, '', '', '')
> >> (2, 2, 1, 1, '', '', '')
> >> (2, 3, 1, 1, '', '', '')
> >> (4, 1, 1, 4, 2, 2)
> >> (5, 1, 1, 5, 2, 2)
> >> ('','','',6, 2, 2)
> >> ('','','',7, 2, 2)
> >> ('','','',8, 2, 2)
> >>
> >> This is the expected result if you do an outer join in SQL. How can I
> >> modify the script to get this result?  (apart of doing 3 FILTERs over C
> >> and then a UNION)
> >>
> >> Thanks and regards,
> >> Iván de Prado
> >> www.ivanprado.es
> >>
> >>
> >


Re: Outer joins

Posted by Tanton Gibbs <ta...@gmail.com>.
I created a simple UDF that defines an "Identity" tuple to allow me to
use constants.  If there is a built-in way you can use that, instead.

Here is the UDF:
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;

public class IdentityTuple extends EvalFunc<Tuple> {

  @Override
  public void exec(Tuple input, Tuple output) throws IOException {
    output.copyFrom(input);
  }
}

Here is the script:
register Identity.jar
A = LOAD 'mytestA.txt' USING PigStorage();
B = LOAD 'mytestB.txt' USING PigStorage();
C = COGROUP A BY $0, B BY $0;
D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? IdentityTuple('',
'', '') : A)), flatten(((COUNT(B) == '0') ? IdentityTuple('', '', '')
: B));
dump D;

Here is the output:
(1, 1, 1, , , )
(2, 1, 1, , , )
(3, 1, 1, , , )
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
(, , , 6, 2, 2)
(, , , 7, 2, 2)

On Wed, Jun 11, 2008 at 10:07 AM, Tanton Gibbs <ta...@gmail.com> wrote:
> You can almost do it, but I can't seem to figure out how to generate a
> constant tuple.
>
> Here is code that works and gets close to what you want, but not quite:
>
> $ cat pigscript.test
> A = LOAD 'mytestA.txt' USING PigStorage();
> B = LOAD 'mytestB.txt' USING PigStorage();
> C = COGROUP A BY $0, B BY $0;
> D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? '' : A)),
> flatten(((COUNT(B) == '0') ? '' : B));
> dump D;
>
> (1, 1, 1, )
> (2, 1, 1, )
> (3, 1, 1, )
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> (, 6, 2, 2)
> (, 7, 2, 2)
>
> Now, if  I could figure out how to use a constant ('', '', '') instead
> of the single '', then I would have what you are looking for, but I
> can't seem to get that to work.
>
> Ideas?
>
> On Wed, Jun 11, 2008 at 9:16 AM, Iván de Prado
> <iv...@gmail.com> wrote:
>> Lets suppose 1.txt is:
>>
>> 1       1       1
>> 2       1       1
>> 3       1       1
>> 4       1       1
>> 5       1       1
>>
>> And 2.txt is:
>>
>> 4       2       2
>> 5       2       2
>> 6       2       2
>> 7       2       2
>>
>> The script:
>>
>> A = LOAD 'ivan/1.txt' USING PigStorage();
>> B = LOAD 'ivan/2.txt' USING PigStorage();
>> C = COGROUP A by $0, B by $0;
>> D = FOREACH C GENERATE flatten(A), flatten(B);
>>
>> dump C:
>>
>> (1, {(1, 1, 1)}, {})
>> (2, {(2, 1, 1)}, {})
>> (3, {(3, 1, 1)}, {})
>> (4, {(4, 1, 1)}, {(4, 2, 2)})
>> (5, {(5, 1, 1)}, {(5, 2, 2)})
>> (6, {}, {(6, 2, 2)})
>> (7, {}, {(7, 2, 2)})
>> (8, {}, {(8, 2, 2)})
>>
>> dump D;
>>
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>>
>> But this is not the result that I expected. I would like to obtain this
>> result:
>>
>> (1, 1, 1, 1, '', '', '')
>> (2, 2, 1, 1, '', '', '')
>> (2, 3, 1, 1, '', '', '')
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>> ('','','',6, 2, 2)
>> ('','','',7, 2, 2)
>> ('','','',8, 2, 2)
>>
>> This is the expected result if you do an outer join in SQL. How can I
>> modify the script to get this result?  (apart of doing 3 FILTERs over C
>> and then a UNION)
>>
>> Thanks and regards,
>> Iván de Prado
>> www.ivanprado.es
>>
>>
>

Re: Outer joins

Posted by Tanton Gibbs <ta...@gmail.com>.
You can almost do it, but I can't seem to figure out how to generate a
constant tuple.

Here is code that works and gets close to what you want, but not quite:

$ cat pigscript.test
A = LOAD 'mytestA.txt' USING PigStorage();
B = LOAD 'mytestB.txt' USING PigStorage();
C = COGROUP A BY $0, B BY $0;
D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? '' : A)),
flatten(((COUNT(B) == '0') ? '' : B));
dump D;

(1, 1, 1, )
(2, 1, 1, )
(3, 1, 1, )
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
(, 6, 2, 2)
(, 7, 2, 2)

Now, if  I could figure out how to use a constant ('', '', '') instead
of the single '', then I would have what you are looking for, but I
can't seem to get that to work.

Ideas?

On Wed, Jun 11, 2008 at 9:16 AM, Iván de Prado
<iv...@gmail.com> wrote:
> Lets suppose 1.txt is:
>
> 1       1       1
> 2       1       1
> 3       1       1
> 4       1       1
> 5       1       1
>
> And 2.txt is:
>
> 4       2       2
> 5       2       2
> 6       2       2
> 7       2       2
>
> The script:
>
> A = LOAD 'ivan/1.txt' USING PigStorage();
> B = LOAD 'ivan/2.txt' USING PigStorage();
> C = COGROUP A by $0, B by $0;
> D = FOREACH C GENERATE flatten(A), flatten(B);
>
> dump C:
>
> (1, {(1, 1, 1)}, {})
> (2, {(2, 1, 1)}, {})
> (3, {(3, 1, 1)}, {})
> (4, {(4, 1, 1)}, {(4, 2, 2)})
> (5, {(5, 1, 1)}, {(5, 2, 2)})
> (6, {}, {(6, 2, 2)})
> (7, {}, {(7, 2, 2)})
> (8, {}, {(8, 2, 2)})
>
> dump D;
>
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
>
> But this is not the result that I expected. I would like to obtain this
> result:
>
> (1, 1, 1, 1, '', '', '')
> (2, 2, 1, 1, '', '', '')
> (2, 3, 1, 1, '', '', '')
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> ('','','',6, 2, 2)
> ('','','',7, 2, 2)
> ('','','',8, 2, 2)
>
> This is the expected result if you do an outer join in SQL. How can I
> modify the script to get this result?  (apart of doing 3 FILTERs over C
> and then a UNION)
>
> Thanks and regards,
> Iván de Prado
> www.ivanprado.es
>
>