You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Florian Zumkeller-Quast <fl...@adition.com> on 2012/07/25 11:48:32 UTC

Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Hello,
I got the following code:

A = LOAD '§file1' USING AvroStorage();
B = LOAD '$file2' USING AvroStorage();
C = JOIN A BY id LEFT OUTER, B BY id;
SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;

DESCRIBE shows the following data structure

D: {A::id: long,A::time: int,B::id: long,B::time: int}
E: {A::id: long,A::time: int,B::id: long,B::time: int}

But i can't store D and E using AvroStorage because the filed names contain 
"::" which is not an allowed character.

I need  structure like
F: {id: long,time: int}
where id = E::A::id and time = E::A::time.

The problem is: The number, name and type of fields may vary.

So E might looks like
E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}

Thus I can't use

F = FOREACH … GENERATE …;

because i don't want to write code for each filetype as long as I don't really 
need to.

Can someone give me an advice how to get the result I need?

Thanks!

With kind regards
Florian Zumkeller-Quast
-- 
Developer
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg
 
http://www.adition.com
 
T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION
 
(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
 
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Alex Rovner <al...@gmail.com>.
We have two options here:
1. Fail in this case with appropriate error message
2. Or simply replace "::" with an allowable character. This way the relation is preserved and everyone is happy. 

Sent from my iPhone

On Jul 26, 2012, at 11:01 AM, Alan Gates <ga...@hortonworks.com> wrote:

> How will you handle ambiguities when there is an A::b and B::b?
> 
> Alan.
> 
> On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:
> 
>> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". 
>> 
>> Thoughts?
>> 
>> Sent from my iPhone
>> 
>> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
>> 
>>> Hello,
>>> I got the following code:
>>> 
>>> A = LOAD '§file1' USING AvroStorage();
>>> B = LOAD '$file2' USING AvroStorage();
>>> C = JOIN A BY id LEFT OUTER, B BY id;
>>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>>> 
>>> DESCRIBE shows the following data structure
>>> 
>>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>>> 
>>> But i can't store D and E using AvroStorage because the filed names contain 
>>> "::" which is not an allowed character.
>>> 
>>> I need  structure like
>>> F: {id: long,time: int}
>>> where id = E::A::id and time = E::A::time.
>>> 
>>> The problem is: The number, name and type of fields may vary.
>>> 
>>> So E might looks like
>>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>>> 
>>> Thus I can't use
>>> 
>>> F = FOREACH … GENERATE …;
>>> 
>>> because i don't want to write code for each filetype as long as I don't really 
>>> need to.
>>> 
>>> Can someone give me an advice how to get the result I need?
>>> 
>>> Thanks!
>>> 
>>> With kind regards
>>> Florian Zumkeller-Quast
>>> -- 
>>> Developer
>>> ________________________________________________________
>>> 
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>> 
>>> http://www.adition.com
>>> 
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49  / (0)1805 - ADITION
>>> 
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>> 
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
> 

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Russell Jurney <ru...@gmail.com>.
I transform my schemas to not have the Avro invalid character, ':' in
them, before I store. For example:

>>> D: {A::id: long,A::time: int,B::id: long,B::time: int}


D = foreach D generate A::id as a_id, A::time as a_time, B::id as
b_id, B::time as b_time;

You might try creating tuples for A and B, then you could access the
field names as A.id, A.time, B.id, B.time. For example:

D = foreach D generate ToTuple(A::id, A::time) as A, ToTuple(B::id,
B::time) as B;

That will store too, and should be scriptable with a macro?

Russell Jurney http://datasyndrome.com

On Jul 26, 2012, at 8:01 AM, Alan Gates <ga...@hortonworks.com> wrote:

> How will you handle ambiguities when there is an A::b and B::b?
>
> Alan.
>
> On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:
>
>> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b".
>>
>> Thoughts?
>>
>> Sent from my iPhone
>>
>> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
>>
>>> Hello,
>>> I got the following code:
>>>
>>> A = LOAD '§file1' USING AvroStorage();
>>> B = LOAD '$file2' USING AvroStorage();
>>> C = JOIN A BY id LEFT OUTER, B BY id;
>>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>>>
>>> DESCRIBE shows the following data structure
>>>
>>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>>>
>>> But i can't store D and E using AvroStorage because the filed names contain
>>> "::" which is not an allowed character.
>>>
>>> I need  structure like
>>> F: {id: long,time: int}
>>> where id = E::A::id and time = E::A::time.
>>>
>>> The problem is: The number, name and type of fields may vary.
>>>
>>> So E might looks like
>>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>>>
>>> Thus I can't use
>>>
>>> F = FOREACH … GENERATE …;
>>>
>>> because i don't want to write code for each filetype as long as I don't really
>>> need to.
>>>
>>> Can someone give me an advice how to get the result I need?
>>>
>>> Thanks!
>>>
>>> With kind regards
>>> Florian Zumkeller-Quast
>>> --
>>> Developer
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49  / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Alan Gates <ga...@hortonworks.com>.
How will you handle ambiguities when there is an A::b and B::b?

Alan.

On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:

> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". 
> 
> Thoughts?
> 
> Sent from my iPhone
> 
> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
> 
>> Hello,
>> I got the following code:
>> 
>> A = LOAD '§file1' USING AvroStorage();
>> B = LOAD '$file2' USING AvroStorage();
>> C = JOIN A BY id LEFT OUTER, B BY id;
>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>> 
>> DESCRIBE shows the following data structure
>> 
>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>> 
>> But i can't store D and E using AvroStorage because the filed names contain 
>> "::" which is not an allowed character.
>> 
>> I need  structure like
>> F: {id: long,time: int}
>> where id = E::A::id and time = E::A::time.
>> 
>> The problem is: The number, name and type of fields may vary.
>> 
>> So E might looks like
>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>> 
>> Thus I can't use
>> 
>> F = FOREACH … GENERATE …;
>> 
>> because i don't want to write code for each filetype as long as I don't really 
>> need to.
>> 
>> Can someone give me an advice how to get the result I need?
>> 
>> Thanks!
>> 
>> With kind regards
>> Florian Zumkeller-Quast
>> -- 
>> Developer
>> ________________________________________________________
>> 
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>> 
>> http://www.adition.com
>> 
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>> 
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>> 
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434


Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Alex Rovner <al...@gmail.com>.
I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". 

Thoughts?

Sent from my iPhone

On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:

> Hello,
> I got the following code:
> 
> A = LOAD '§file1' USING AvroStorage();
> B = LOAD '$file2' USING AvroStorage();
> C = JOIN A BY id LEFT OUTER, B BY id;
> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
> 
> DESCRIBE shows the following data structure
> 
> D: {A::id: long,A::time: int,B::id: long,B::time: int}
> E: {A::id: long,A::time: int,B::id: long,B::time: int}
> 
> But i can't store D and E using AvroStorage because the filed names contain 
> "::" which is not an allowed character.
> 
> I need  structure like
> F: {id: long,time: int}
> where id = E::A::id and time = E::A::time.
> 
> The problem is: The number, name and type of fields may vary.
> 
> So E might looks like
> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
> 
> Thus I can't use
> 
> F = FOREACH … GENERATE …;
> 
> because i don't want to write code for each filetype as long as I don't really 
> need to.
> 
> Can someone give me an advice how to get the result I need?
> 
> Thanks!
> 
> With kind regards
> Florian Zumkeller-Quast
> -- 
> Developer
> ________________________________________________________
> 
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
> 
> http://www.adition.com
> 
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
> 
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
> 
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Florian Zumkeller-Quast <fl...@adition.com>.
Am Donnerstag, 26. Juli 2012, 17:55:48 schrieb Florian Zumkeller-Quast:
> Am Mittwoch, 25. Juli 2012, 16:22:01 schrieb Alan Gates:
> > Basically you need to transform the schema, not the data.  The easiest
> > way I can think of to do that is to use a UDF that has an outputSchema
> > function that renames columns.  The exec call can then be a simple pass
> > through.
> 
> How do I pass the schema to the UDF?

The solution:

D = FOREACH C GENERATE myUDF(*);

This passes an unknown amount of uknown field names to the UDF as Tuple.
I just have to return it as Tuple.

If i try to store the return Tuple, I get an avro schema like:
But because I need the schema to be the same as the input schema (and this one 
differs because of the "null" and because of the data stored as tuple per 
record.

["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"id","type":
["null","long"],"doc":"autogenerated from Pig Field Schema"},
{"name":"time","type":["null","int"],"doc":"autogenerated from Pig Field 
Schema"}]}]

Dumped with Pig:

((id, time, …))

My idea was to flatten it.

D = FOREACH C GENERATE flatten(myUDF(*));

This doesn't work either.

When I am trying to store it, it get the old error again:

ERROR 2999: Unexpected internal error. Illegal character in: null::id

So I'm back at the beginning - with the same but slightly different error.

Can someone give me an advice how to trip the "null" namespace?

The modified Storage was already proposed in this thread. Because i got no 
duplicate names inside the tuple, that might be an solution. But is it the 
only one?

Has anybody here an advice for me?
How do I get the inner tuple data as record with the schema like it already 
is? Without knowing names and number of fields!


With king regards

Florian Zumkeller-Quast
-- 
Developer
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg
 
http://www.adition.com
 
T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION
 
(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
 
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Florian Zumkeller-Quast <fl...@adition.com>.
Am Mittwoch, 25. Juli 2012, 16:22:01 schrieb Alan Gates:
> Basically you need to transform the schema, not the data.  The easiest way
> I can think of to do that is to use a UDF that has an outputSchema
> function that renames columns.  The exec call can then be a simple pass
> through.  

How do I pass the schema to the UDF?

It is clear how the UDF code could look like.
But how does the corresponding pig script code has to look like?

A = LOAD …
B = LOAD …
C= JOIN A, B …


D = myUDF( C::A);
does obviously not work because there is no alias C::A

D = myUDF( C, 'A');
doesn't work either (Syntax error)

The last shown variant shown here is the first i thought of.
It think it would be the most intuitive way 

D = FOREACH C GENERATE myUDF( C, 'A');

But this is also invalid:
 Invalid scalar projection: d : A column needs to be projected from a relation 
for it to be used as a scalar.

So, my question ist: How do I pass the schema/data to my UDF if i don't know 
the number or names of the parameters?


With kind regards

Florian Zumkeller-Quast
-- 
Developer
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg
 
http://www.adition.com
 
T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION
 
(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
 
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

Posted by Alan Gates <ga...@hortonworks.com>.
Basically you need to transform the schema, not the data.  The easiest way I can think of to do that is to use a UDF that has an outputSchema function that renames columns.  The exec call can then be a simple pass through.  

If you wanted to you could have it consolidate the join keys.  You imply you would like to consolidate other columns as well (A::E::time in your example), but that is not valid.  Since time is not a join key it will not necessarily be the same in A and E.

Alan.

On Jul 25, 2012, at 2:48 AM, Florian Zumkeller-Quast wrote:

> Hello,
> I got the following code:
> 
> A = LOAD '§file1' USING AvroStorage();
> B = LOAD '$file2' USING AvroStorage();
> C = JOIN A BY id LEFT OUTER, B BY id;
> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
> 
> DESCRIBE shows the following data structure
> 
> D: {A::id: long,A::time: int,B::id: long,B::time: int}
> E: {A::id: long,A::time: int,B::id: long,B::time: int}
> 
> But i can't store D and E using AvroStorage because the filed names contain 
> "::" which is not an allowed character.
> 
> I need  structure like
> F: {id: long,time: int}
> where id = E::A::id and time = E::A::time.
> 
> The problem is: The number, name and type of fields may vary.
> 
> So E might looks like
> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
> 
> Thus I can't use
> 
> F = FOREACH … GENERATE …;
> 
> because i don't want to write code for each filetype as long as I don't really 
> need to.
> 
> Can someone give me an advice how to get the result I need?
> 
> Thanks!
> 
> With kind regards
> Florian Zumkeller-Quast
> -- 
> Developer
> ________________________________________________________
> 
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
> 
> http://www.adition.com
> 
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
> 
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
> 
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434