You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by rob parker <ro...@gmail.com> on 2011/07/29 18:47:07 UTC
Understand Schema after a Join
Trying to join two sets and generate a set from the join and I am getting a
$ hadoop fs -cat DIM/\*
2011,01,31
2011,02,28
2011,03,31
2011,04,30
2011,05,31
2011,06,30
2011,07,31
2011,08,31
2011,09,30
2011,10,31
2011,11,30
2011,12,31
$ hadoop fs -cat ACCT/\*
2011,7,26,key1,23.25,2470.0
2011,7,26,key2,10.416666666666668,232274.08333333334
2011,7,26,key3,82.83333333333333,541377.25
2011,7,26,key4,78.5,492823.33333333326
2011,7,26,key5,110.83333333333334,729811.9166666667
2011,7,26,key6,102.16666666666666,675941.25
2011,7,26,key7,118.91666666666666,770896.75
grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
days:int);
grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
day: int, account:chararray, metric1:double, metric2:double);
grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
'replicated';
grunt> dump AjD;
...
(2011,7,26,key1,23.25,2470.0,2011,7,31)
(2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
(2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
(2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
(2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
(2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
(2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
grunt> describe AjD;
AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
int,DIM::month: int,DIM::days: int}
grunt> FINAL = FOREACH AjD
>> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
grunt> dump FINAL;
...
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator
for alias FINAL. Backend error : Scalar has more than one row in the output.
1st : (2011,7,26,key1,23.25,2470.0), 2nd
:(2011,7,26,key2,10.416666666666668,232274.08333333334)
However if I store it and reload it to shed the "join" schema:
grunt> STORE AjD INTO 'AjD' using PigStorage(',');
grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
day:int, account:chararray, metric1:double, metric2:double, year2:int,
month2:int, days:int);
grunt> FINAL = FOREACH AjD2
>> GENERATE year, month, account, (metric2 /days);
grunt> dump FINAL;
...
(2011,7,key1,79.6774193548387)
(2011,7,key2,7492.712365591398)
(2011,7,key3,17463.782258064515)
(2011,7,key4,15897.526881720427)
(2011,7,key5,23542.319892473122)
(2011,7,key6,21804.5564516129)
(2011,7,key7,24867.637096774193)
What am I missing to make this work without storing and reloading?
Thanks,
Rob
Re: Understand Schema after a Join
Posted by Thejas Nair <th...@hortonworks.com>.
On 7/29/11 9:47 AM, rob parker wrote:
> grunt> describe AjD;
> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
> int,DIM::month: int,DIM::days: int}
>
> grunt> FINAL = FOREACH AjD
>>> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
> grunt> dump FINAL;
> ...
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator
> for alias FINAL. Backend error : Scalar has more than one row in the output.
> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
The column in AjD schema is ACCT::year, not ACCT.year (note the "::" vs
".").
So you need to change it to -
grunt> FINAL = FOREACH AjD
GENERATE ACCT::year, ACCT::month, ACCT::account, (ACCT::metric2 /
DIM::days);
You are accidentally using the relation-as-scalar feature here.
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
There is a jira open to deprecate the syntax for this feature so that
users don't accidentally end up using it -
https://issues.apache.org/jira/browse/PIG-1967
-Thejas
Re: Understand Schema after a Join
Posted by rob parker <ro...@gmail.com>.
Thanks!
Rob
On Fri, Jul 29, 2011 at 11:07 AM, Raghu Angadi <an...@gmail.com> wrote:
> Is implicit scalar conversion going to stay in PIG? My preference would to
> make it explicit like SCALAR(ACCT.year)..
>
> On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi <an...@gmail.com> wrote:
>
> > > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
> DIM.days);
> >
> > should be GENERATE ACCT::year, ACCT::month ... etc.
> >
> > this is a common mistake to use '.' instead of '::'.. I wish the error
> > message is more user friendly..
> > PIG supports 'scalars' and assumes your ACCT would be a single row table
> at
> > runtime when access fields like ACCT.year.
> >
> >
> > On Fri, Jul 29, 2011 at 9:47 AM, rob parker <rob.s.parker@gmail.com
> >wrote:
> >
> >> Trying to join two sets and generate a set from the join and I am
> getting
> >> a
> >>
> >>
> >> $ hadoop fs -cat DIM/\*
> >> 2011,01,31
> >> 2011,02,28
> >> 2011,03,31
> >> 2011,04,30
> >> 2011,05,31
> >> 2011,06,30
> >> 2011,07,31
> >> 2011,08,31
> >> 2011,09,30
> >> 2011,10,31
> >> 2011,11,30
> >> 2011,12,31
> >>
> >>
> >> $ hadoop fs -cat ACCT/\*
> >> 2011,7,26,key1,23.25,2470.0
> >> 2011,7,26,key2,10.416666666666668,232274.08333333334
> >> 2011,7,26,key3,82.83333333333333,541377.25
> >> 2011,7,26,key4,78.5,492823.33333333326
> >> 2011,7,26,key5,110.83333333333334,729811.9166666667
> >> 2011,7,26,key6,102.16666666666666,675941.25
> >> 2011,7,26,key7,118.91666666666666,770896.75
> >>
> >>
> >> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
> >> days:int);
> >> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
> >> day: int, account:chararray, metric1:double, metric2:double);
> >> grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
> >> 'replicated';
> >> grunt> dump AjD;
> >> ...
> >> (2011,7,26,key1,23.25,2470.0,2011,7,31)
> >> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
> >> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
> >> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
> >> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
> >> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
> >> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
> >> grunt> describe AjD;
> >> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
> >> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
> >> int,DIM::month: int,DIM::days: int}
> >>
> >> grunt> FINAL = FOREACH AjD
> >> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
> >> DIM.days);
> >> grunt> dump FINAL;
> >> ...
> >> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
> >> iterator
> >> for alias FINAL. Backend error : Scalar has more than one row in the
> >> output.
> >> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
> >> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
> >>
> >> However if I store it and reload it to shed the "join" schema:
> >>
> >> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
> >> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
> >> day:int, account:chararray, metric1:double, metric2:double, year2:int,
> >> month2:int, days:int);
> >>
> >> grunt> FINAL = FOREACH AjD2
> >>
> >> >> GENERATE year, month, account, (metric2 /days);
> >>
> >> grunt> dump FINAL;
> >> ...
> >> (2011,7,key1,79.6774193548387)
> >> (2011,7,key2,7492.712365591398)
> >> (2011,7,key3,17463.782258064515)
> >> (2011,7,key4,15897.526881720427)
> >> (2011,7,key5,23542.319892473122)
> >> (2011,7,key6,21804.5564516129)
> >> (2011,7,key7,24867.637096774193)
> >>
> >> What am I missing to make this work without storing and reloading?
> >>
> >> Thanks,
> >> Rob
> >>
> >
> >
>
Re: Understand Schema after a Join
Posted by Thejas Nair <th...@hortonworks.com>.
On 7/29/11 11:07 AM, Raghu Angadi wrote:
> Is implicit scalar conversion going to stay in PIG? My preference would to
> make it explicit like SCALAR(ACCT.year)..
>
That's my preference as well , you can vote/ submit a patch ! -
https://issues.apache.org/jira/browse/PIG-1967
-Thejas
Re: Understand Schema after a Join
Posted by Raghu Angadi <an...@gmail.com>.
Is implicit scalar conversion going to stay in PIG? My preference would to
make it explicit like SCALAR(ACCT.year)..
On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi <an...@gmail.com> wrote:
> > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
>
> should be GENERATE ACCT::year, ACCT::month ... etc.
>
> this is a common mistake to use '.' instead of '::'.. I wish the error
> message is more user friendly..
> PIG supports 'scalars' and assumes your ACCT would be a single row table at
> runtime when access fields like ACCT.year.
>
>
> On Fri, Jul 29, 2011 at 9:47 AM, rob parker <ro...@gmail.com>wrote:
>
>> Trying to join two sets and generate a set from the join and I am getting
>> a
>>
>>
>> $ hadoop fs -cat DIM/\*
>> 2011,01,31
>> 2011,02,28
>> 2011,03,31
>> 2011,04,30
>> 2011,05,31
>> 2011,06,30
>> 2011,07,31
>> 2011,08,31
>> 2011,09,30
>> 2011,10,31
>> 2011,11,30
>> 2011,12,31
>>
>>
>> $ hadoop fs -cat ACCT/\*
>> 2011,7,26,key1,23.25,2470.0
>> 2011,7,26,key2,10.416666666666668,232274.08333333334
>> 2011,7,26,key3,82.83333333333333,541377.25
>> 2011,7,26,key4,78.5,492823.33333333326
>> 2011,7,26,key5,110.83333333333334,729811.9166666667
>> 2011,7,26,key6,102.16666666666666,675941.25
>> 2011,7,26,key7,118.91666666666666,770896.75
>>
>>
>> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
>> days:int);
>> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
>> day: int, account:chararray, metric1:double, metric2:double);
>> grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
>> 'replicated';
>> grunt> dump AjD;
>> ...
>> (2011,7,26,key1,23.25,2470.0,2011,7,31)
>> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
>> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
>> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
>> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
>> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
>> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
>> grunt> describe AjD;
>> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
>> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
>> int,DIM::month: int,DIM::days: int}
>>
>> grunt> FINAL = FOREACH AjD
>> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
>> DIM.days);
>> grunt> dump FINAL;
>> ...
>> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
>> iterator
>> for alias FINAL. Backend error : Scalar has more than one row in the
>> output.
>> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
>> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>>
>> However if I store it and reload it to shed the "join" schema:
>>
>> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
>> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
>> day:int, account:chararray, metric1:double, metric2:double, year2:int,
>> month2:int, days:int);
>>
>> grunt> FINAL = FOREACH AjD2
>>
>> >> GENERATE year, month, account, (metric2 /days);
>>
>> grunt> dump FINAL;
>> ...
>> (2011,7,key1,79.6774193548387)
>> (2011,7,key2,7492.712365591398)
>> (2011,7,key3,17463.782258064515)
>> (2011,7,key4,15897.526881720427)
>> (2011,7,key5,23542.319892473122)
>> (2011,7,key6,21804.5564516129)
>> (2011,7,key7,24867.637096774193)
>>
>> What am I missing to make this work without storing and reloading?
>>
>> Thanks,
>> Rob
>>
>
>
Re: Understand Schema after a Join
Posted by Raghu Angadi <an...@gmail.com>.
> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
should be GENERATE ACCT::year, ACCT::month ... etc.
this is a common mistake to use '.' instead of '::'.. I wish the error
message is more user friendly..
PIG supports 'scalars' and assumes your ACCT would be a single row table at
runtime when access fields like ACCT.year.
On Fri, Jul 29, 2011 at 9:47 AM, rob parker <ro...@gmail.com> wrote:
> Trying to join two sets and generate a set from the join and I am getting a
>
>
> $ hadoop fs -cat DIM/\*
> 2011,01,31
> 2011,02,28
> 2011,03,31
> 2011,04,30
> 2011,05,31
> 2011,06,30
> 2011,07,31
> 2011,08,31
> 2011,09,30
> 2011,10,31
> 2011,11,30
> 2011,12,31
>
>
> $ hadoop fs -cat ACCT/\*
> 2011,7,26,key1,23.25,2470.0
> 2011,7,26,key2,10.416666666666668,232274.08333333334
> 2011,7,26,key3,82.83333333333333,541377.25
> 2011,7,26,key4,78.5,492823.33333333326
> 2011,7,26,key5,110.83333333333334,729811.9166666667
> 2011,7,26,key6,102.16666666666666,675941.25
> 2011,7,26,key7,118.91666666666666,770896.75
>
>
> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
> days:int);
> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
> day: int, account:chararray, metric1:double, metric2:double);
> grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
> 'replicated';
> grunt> dump AjD;
> ...
> (2011,7,26,key1,23.25,2470.0,2011,7,31)
> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
> grunt> describe AjD;
> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
> int,DIM::month: int,DIM::days: int}
>
> grunt> FINAL = FOREACH AjD
> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
> grunt> dump FINAL;
> ...
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
> iterator
> for alias FINAL. Backend error : Scalar has more than one row in the
> output.
> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>
> However if I store it and reload it to shed the "join" schema:
>
> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
> day:int, account:chararray, metric1:double, metric2:double, year2:int,
> month2:int, days:int);
>
> grunt> FINAL = FOREACH AjD2
>
> >> GENERATE year, month, account, (metric2 /days);
>
> grunt> dump FINAL;
> ...
> (2011,7,key1,79.6774193548387)
> (2011,7,key2,7492.712365591398)
> (2011,7,key3,17463.782258064515)
> (2011,7,key4,15897.526881720427)
> (2011,7,key5,23542.319892473122)
> (2011,7,key6,21804.5564516129)
> (2011,7,key7,24867.637096774193)
>
> What am I missing to make this work without storing and reloading?
>
> Thanks,
> Rob
>
Re: Understand Schema after a Join
Posted by Norbert Burger <no...@gmail.com>.
Isn't the query parser getting confused because you've chosen the same
column name mappings for the LHS and RHS of your join?
What happens if you change this:
AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';
to:
AjD = JOIN ACCT BY ($0,$1), DIM BY ($0,$1) USING 'replicated';
Norbert
On Fri, Jul 29, 2011 at 12:47 PM, rob parker <ro...@gmail.com> wrote:
> Trying to join two sets and generate a set from the join and I am getting a
>
>
> $ hadoop fs -cat DIM/\*
> 2011,01,31
> 2011,02,28
> 2011,03,31
> 2011,04,30
> 2011,05,31
> 2011,06,30
> 2011,07,31
> 2011,08,31
> 2011,09,30
> 2011,10,31
> 2011,11,30
> 2011,12,31
>
>
> $ hadoop fs -cat ACCT/\*
> 2011,7,26,key1,23.25,2470.0
> 2011,7,26,key2,10.416666666666668,232274.08333333334
> 2011,7,26,key3,82.83333333333333,541377.25
> 2011,7,26,key4,78.5,492823.33333333326
> 2011,7,26,key5,110.83333333333334,729811.9166666667
> 2011,7,26,key6,102.16666666666666,675941.25
> 2011,7,26,key7,118.91666666666666,770896.75
>
>
> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
> days:int);
> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
> day: int, account:chararray, metric1:double, metric2:double);
> grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
> 'replicated';
> grunt> dump AjD;
> ...
> (2011,7,26,key1,23.25,2470.0,2011,7,31)
> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
> grunt> describe AjD;
> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
> int,DIM::month: int,DIM::days: int}
>
> grunt> FINAL = FOREACH AjD
> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
> grunt> dump FINAL;
> ...
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
> iterator
> for alias FINAL. Backend error : Scalar has more than one row in the
> output.
> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>
> However if I store it and reload it to shed the "join" schema:
>
> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
> day:int, account:chararray, metric1:double, metric2:double, year2:int,
> month2:int, days:int);
>
> grunt> FINAL = FOREACH AjD2
>
> >> GENERATE year, month, account, (metric2 /days);
>
> grunt> dump FINAL;
> ...
> (2011,7,key1,79.6774193548387)
> (2011,7,key2,7492.712365591398)
> (2011,7,key3,17463.782258064515)
> (2011,7,key4,15897.526881720427)
> (2011,7,key5,23542.319892473122)
> (2011,7,key6,21804.5564516129)
> (2011,7,key7,24867.637096774193)
>
> What am I missing to make this work without storing and reloading?
>
> Thanks,
> Rob
>