You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Florian Zumkeller-Quast <fl...@adition.com> on 2012/07/25 11:48:32 UTC
Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Hello,
I got the following code:
A = LOAD '§file1' USING AvroStorage();
B = LOAD '$file2' USING AvroStorage();
C = JOIN A BY id LEFT OUTER, B BY id;
SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
DESCRIBE shows the following data structure
D: {A::id: long,A::time: int,B::id: long,B::time: int}
E: {A::id: long,A::time: int,B::id: long,B::time: int}
But i can't store D and E using AvroStorage because the filed names contain
"::" which is not an allowed character.
I need structure like
F: {id: long,time: int}
where id = E::A::id and time = E::A::time.
The problem is: The number, name and type of fields may vary.
So E might looks like
E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
Thus I can't use
F = FOREACH … GENERATE …;
because i don't want to write code for each filetype as long as I don't really
need to.
Can someone give me an advice how to get the result I need?
Thanks!
With kind regards
Florian Zumkeller-Quast
--
Developer
________________________________________________________
ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg
http://www.adition.com
T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49 / (0)1805 - ADITION
(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Posted by Alex Rovner <al...@gmail.com>.
We have two options here:
1. Fail in this case with appropriate error message
2. Or simply replace "::" with an allowable character. This way the relation is preserved and everyone is happy.
Sent from my iPhone
On Jul 26, 2012, at 11:01 AM, Alan Gates <ga...@hortonworks.com> wrote:
> How will you handle ambiguities when there is an A::b and B::b?
>
> Alan.
>
> On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:
>
>> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b".
>>
>> Thoughts?
>>
>> Sent from my iPhone
>>
>> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
>>
>>> Hello,
>>> I got the following code:
>>>
>>> A = LOAD '§file1' USING AvroStorage();
>>> B = LOAD '$file2' USING AvroStorage();
>>> C = JOIN A BY id LEFT OUTER, B BY id;
>>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>>>
>>> DESCRIBE shows the following data structure
>>>
>>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>>>
>>> But i can't store D and E using AvroStorage because the filed names contain
>>> "::" which is not an allowed character.
>>>
>>> I need structure like
>>> F: {id: long,time: int}
>>> where id = E::A::id and time = E::A::time.
>>>
>>> The problem is: The number, name and type of fields may vary.
>>>
>>> So E might looks like
>>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>>>
>>> Thus I can't use
>>>
>>> F = FOREACH … GENERATE …;
>>>
>>> because i don't want to write code for each filetype as long as I don't really
>>> need to.
>>>
>>> Can someone give me an advice how to get the result I need?
>>>
>>> Thanks!
>>>
>>> With kind regards
>>> Florian Zumkeller-Quast
>>> --
>>> Developer
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49 / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>
Re: Access only data from LEFT OUTER JOIN side of joined data without
projection prefix
Posted by Russell Jurney <ru...@gmail.com>.
I transform my schemas to not have the Avro invalid character, ':' in
them, before I store. For example:
>>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
D = foreach D generate A::id as a_id, A::time as a_time, B::id as
b_id, B::time as b_time;
You might try creating tuples for A and B, then you could access the
field names as A.id, A.time, B.id, B.time. For example:
D = foreach D generate ToTuple(A::id, A::time) as A, ToTuple(B::id,
B::time) as B;
That will store too, and should be scriptable with a macro?
Russell Jurney http://datasyndrome.com
On Jul 26, 2012, at 8:01 AM, Alan Gates <ga...@hortonworks.com> wrote:
> How will you handle ambiguities when there is an A::b and B::b?
>
> Alan.
>
> On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:
>
>> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b".
>>
>> Thoughts?
>>
>> Sent from my iPhone
>>
>> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
>>
>>> Hello,
>>> I got the following code:
>>>
>>> A = LOAD '§file1' USING AvroStorage();
>>> B = LOAD '$file2' USING AvroStorage();
>>> C = JOIN A BY id LEFT OUTER, B BY id;
>>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>>>
>>> DESCRIBE shows the following data structure
>>>
>>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>>>
>>> But i can't store D and E using AvroStorage because the filed names contain
>>> "::" which is not an allowed character.
>>>
>>> I need structure like
>>> F: {id: long,time: int}
>>> where id = E::A::id and time = E::A::time.
>>>
>>> The problem is: The number, name and type of fields may vary.
>>>
>>> So E might looks like
>>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>>>
>>> Thus I can't use
>>>
>>> F = FOREACH … GENERATE …;
>>>
>>> because i don't want to write code for each filetype as long as I don't really
>>> need to.
>>>
>>> Can someone give me an advice how to get the result I need?
>>>
>>> Thanks!
>>>
>>> With kind regards
>>> Florian Zumkeller-Quast
>>> --
>>> Developer
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49 / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Posted by Alan Gates <ga...@hortonworks.com>.
How will you handle ambiguities when there is an A::b and B::b?
Alan.
On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:
> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b".
>
> Thoughts?
>
> Sent from my iPhone
>
> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
>
>> Hello,
>> I got the following code:
>>
>> A = LOAD '§file1' USING AvroStorage();
>> B = LOAD '$file2' USING AvroStorage();
>> C = JOIN A BY id LEFT OUTER, B BY id;
>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>>
>> DESCRIBE shows the following data structure
>>
>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>>
>> But i can't store D and E using AvroStorage because the filed names contain
>> "::" which is not an allowed character.
>>
>> I need structure like
>> F: {id: long,time: int}
>> where id = E::A::id and time = E::A::time.
>>
>> The problem is: The number, name and type of fields may vary.
>>
>> So E might looks like
>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>>
>> Thus I can't use
>>
>> F = FOREACH … GENERATE …;
>>
>> because i don't want to write code for each filetype as long as I don't really
>> need to.
>>
>> Can someone give me an advice how to get the result I need?
>>
>> Thanks!
>>
>> With kind regards
>> Florian Zumkeller-Quast
>> --
>> Developer
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49 / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Posted by Alex Rovner <al...@gmail.com>.
I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b".
Thoughts?
Sent from my iPhone
On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <fl...@adition.com> wrote:
> Hello,
> I got the following code:
>
> A = LOAD '§file1' USING AvroStorage();
> B = LOAD '$file2' USING AvroStorage();
> C = JOIN A BY id LEFT OUTER, B BY id;
> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>
> DESCRIBE shows the following data structure
>
> D: {A::id: long,A::time: int,B::id: long,B::time: int}
> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>
> But i can't store D and E using AvroStorage because the filed names contain
> "::" which is not an allowed character.
>
> I need structure like
> F: {id: long,time: int}
> where id = E::A::id and time = E::A::time.
>
> The problem is: The number, name and type of fields may vary.
>
> So E might looks like
> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>
> Thus I can't use
>
> F = FOREACH … GENERATE …;
>
> because i don't want to write code for each filetype as long as I don't really
> need to.
>
> Can someone give me an advice how to get the result I need?
>
> Thanks!
>
> With kind regards
> Florian Zumkeller-Quast
> --
> Developer
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49 / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Posted by Florian Zumkeller-Quast <fl...@adition.com>.
Am Donnerstag, 26. Juli 2012, 17:55:48 schrieb Florian Zumkeller-Quast:
> Am Mittwoch, 25. Juli 2012, 16:22:01 schrieb Alan Gates:
> > Basically you need to transform the schema, not the data. The easiest
> > way I can think of to do that is to use a UDF that has an outputSchema
> > function that renames columns. The exec call can then be a simple pass
> > through.
>
> How do I pass the schema to the UDF?
The solution:
D = FOREACH C GENERATE myUDF(*);
This passes an unknown amount of uknown field names to the UDF as Tuple.
I just have to return it as Tuple.
If i try to store the return Tuple, I get an avro schema like:
But because I need the schema to be the same as the input schema (and this one
differs because of the "null" and because of the data stored as tuple per
record.
["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"id","type":
["null","long"],"doc":"autogenerated from Pig Field Schema"},
{"name":"time","type":["null","int"],"doc":"autogenerated from Pig Field
Schema"}]}]
Dumped with Pig:
((id, time, …))
My idea was to flatten it.
D = FOREACH C GENERATE flatten(myUDF(*));
This doesn't work either.
When I am trying to store it, it get the old error again:
ERROR 2999: Unexpected internal error. Illegal character in: null::id
So I'm back at the beginning - with the same but slightly different error.
Can someone give me an advice how to trip the "null" namespace?
The modified Storage was already proposed in this thread. Because i got no
duplicate names inside the tuple, that might be an solution. But is it the
only one?
Has anybody here an advice for me?
How do I get the inner tuple data as record with the schema like it already
is? Without knowing names and number of fields!
With king regards
Florian Zumkeller-Quast
--
Developer
________________________________________________________
ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg
http://www.adition.com
T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49 / (0)1805 - ADITION
(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Posted by Florian Zumkeller-Quast <fl...@adition.com>.
Am Mittwoch, 25. Juli 2012, 16:22:01 schrieb Alan Gates:
> Basically you need to transform the schema, not the data. The easiest way
> I can think of to do that is to use a UDF that has an outputSchema
> function that renames columns. The exec call can then be a simple pass
> through.
How do I pass the schema to the UDF?
It is clear how the UDF code could look like.
But how does the corresponding pig script code has to look like?
A = LOAD …
B = LOAD …
C= JOIN A, B …
D = myUDF( C::A);
does obviously not work because there is no alias C::A
D = myUDF( C, 'A');
doesn't work either (Syntax error)
The last shown variant shown here is the first i thought of.
It think it would be the most intuitive way
D = FOREACH C GENERATE myUDF( C, 'A');
But this is also invalid:
Invalid scalar projection: d : A column needs to be projected from a relation
for it to be used as a scalar.
So, my question ist: How do I pass the schema/data to my UDF if i don't know
the number or names of the parameters?
With kind regards
Florian Zumkeller-Quast
--
Developer
________________________________________________________
ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg
http://www.adition.com
T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49 / (0)1805 - ADITION
(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Posted by Alan Gates <ga...@hortonworks.com>.
Basically you need to transform the schema, not the data. The easiest way I can think of to do that is to use a UDF that has an outputSchema function that renames columns. The exec call can then be a simple pass through.
If you wanted to you could have it consolidate the join keys. You imply you would like to consolidate other columns as well (A::E::time in your example), but that is not valid. Since time is not a join key it will not necessarily be the same in A and E.
Alan.
On Jul 25, 2012, at 2:48 AM, Florian Zumkeller-Quast wrote:
> Hello,
> I got the following code:
>
> A = LOAD '§file1' USING AvroStorage();
> B = LOAD '$file2' USING AvroStorage();
> C = JOIN A BY id LEFT OUTER, B BY id;
> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>
> DESCRIBE shows the following data structure
>
> D: {A::id: long,A::time: int,B::id: long,B::time: int}
> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>
> But i can't store D and E using AvroStorage because the filed names contain
> "::" which is not an allowed character.
>
> I need structure like
> F: {id: long,time: int}
> where id = E::A::id and time = E::A::time.
>
> The problem is: The number, name and type of fields may vary.
>
> So E might looks like
> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int}
>
> Thus I can't use
>
> F = FOREACH … GENERATE …;
>
> because i don't want to write code for each filetype as long as I don't really
> need to.
>
> Can someone give me an advice how to get the result I need?
>
> Thanks!
>
> With kind regards
> Florian Zumkeller-Quast
> --
> Developer
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49 / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434