You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Russell Jurney <rj...@ning.com> on 2009/10/14 01:35:30 UTC
Query Question
Sorry if this double sends, just subscribed from this account.
I am running the following query, and am having trouble getting what I want.
I have two data sources, and I am JOINING them, then grouping by a value. I
then want to select the group name, and the row in the group that contains
the highest value of one of its metrics.
-- a list keyed by things with some other stuff
A = LOAD '/my/list/of/things/' USING PigStorage(',') AS
(group_name:chararray, thing1:chararray, thing2:chararray);
-- Load other source
B = LOAD '/my/time/series/metrics' USING PigStorage(,¹)
AS(thing1:chararray, metric1:chararray, metric2:chararray);
-- I have verified that the JOIN here is good data.
C = JOIN A BY QuotePull(thing1), B BY QuotePull(thing1);
-- This groups a series of metric by group_name. Data looks good here too.
D = GROUP C BY group_name;
-- What I want from the next step:
The group_name, and all values for the line in its grouping that has the
highest value for one of its metrics.
I tried to do it in a FOREACH, using ORDER and LIMIT on elements of D, but I
do not how to access elements of the groups when iterating D.
Something like this:
HCOUNT = FOREACH D {
P = ORDER <do not know what to put here> BY $2;
K = LIMIT P 1;
GENERATE group, K.$1, K.$2;
};
But you can¹t apparently do that. There is no example in the language
manual of referring to the items in the thing you are looping, in a FOREACH.
So next, I selected the highest value using MAX, and then I try to JOIN it
back in... The problem is, I have no idea how to JOIN datasets by two values
in Pig:
RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
The first has a syntax error. The second gives: ERROR 1094: Attempt to
insert between two nodes that were not connected.
I¹m stumped. I think this must be doable without writing a UDF. How?
--
Russell Jurney, Viz
rjurney@ning.com
404-317-3620
Re: Query Question
Posted by Russell Jurney <rj...@ning.com>.
Thanks, I've gotten the following to work and it generates my desired
result:
HCOUNT = FOREACH D { K = ORDER C BY my_metric; L = LIMIT K 1; GENERATE
L.(name1, name2); };
However, I cannot make it generate L.name1, L.name2, or anything with two
values. Can you only generate one thing when referring to a new set inside a
FOREACH? What is the rule here?
On 10/13/09 4:54 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> <do not know what to put here> is the relation name that you grouped,
> in this case, "C".
>
> On Tue, Oct 13, 2009 at 7:35 PM, Russell Jurney <rj...@ning.com> wrote:
>> Sorry if this double sends, just subscribed from this account.
>>
>> I am running the following query, and am having trouble getting what I want.
>> I have two data sources, and I am JOINING them, then grouping by a value. I
>> then want to select the group name, and the row in the group that contains
>> the highest value of one of its metrics.
>>
>> -- a list keyed by things with some other stuff
>> A = LOAD '/my/list/of/things/' USING PigStorage(',') AS
>> (group_name:chararray, thing1:chararray, thing2:chararray);
>>
>> -- Load other source
>> B = LOAD '/my/time/series/metrics' USING PigStorage(Œ,¹)
>> AS(thing1:chararray, metric1:chararray, metric2:chararray);
>>
>> -- I have verified that the JOIN here is good data.
>> C = JOIN A BY QuotePull(thing1), B BY QuotePull(thing1);
>>
>> -- This groups a series of metric by group_name. Data looks good here too.
>> D = GROUP C BY group_name;
>>
>> -- What I want from the next step:
>> The group_name, and all values for the line in its grouping that has the
>> highest value for one of its metrics.
>>
>> I tried to do it in a FOREACH, using ORDER and LIMIT on elements of D, but I
>> do not how to access elements of the groups when iterating D.
>>
>> Something like this:
>>
>> HCOUNT = FOREACH D {
>> P = ORDER <do not know what to put here> BY $2;
>> K = LIMIT P 1;
>> GENERATE group, K.$1, K.$2;
>> };
>>
>> But you can¹t apparently do that. There is no example in the language
>> manual of referring to the items in the thing you are looping, in a FOREACH.
>>
>> So next, I selected the highest value using MAX, and then I try to JOIN it
>> back in... The problem is, I have no idea how to JOIN datasets by two values
>> in Pig:
>>
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>>
>> The first has a syntax error. The second gives: ERROR 1094: Attempt to
>> insert between two nodes that were not connected.
>>
>> I¹m stumped. I think this must be doable without writing a UDF. How?
>>
>> --
>> Russell Jurney, Viz
>> rjurney@ning.com
>> 404-317-3620
>>
>>
--
Russell Jurney, Viz
rjurney@ning.com
Re: Query Question
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
<do not know what to put here> is the relation name that you grouped,
in this case, "C".
On Tue, Oct 13, 2009 at 7:35 PM, Russell Jurney <rj...@ning.com> wrote:
> Sorry if this double sends, just subscribed from this account.
>
> I am running the following query, and am having trouble getting what I want.
> I have two data sources, and I am JOINING them, then grouping by a value. I
> then want to select the group name, and the row in the group that contains
> the highest value of one of its metrics.
>
> -- a list keyed by things with some other stuff
> A = LOAD '/my/list/of/things/' USING PigStorage(',') AS
> (group_name:chararray, thing1:chararray, thing2:chararray);
>
> -- Load other source
> B = LOAD '/my/time/series/metrics' USING PigStorage(Œ,¹)
> AS(thing1:chararray, metric1:chararray, metric2:chararray);
>
> -- I have verified that the JOIN here is good data.
> C = JOIN A BY QuotePull(thing1), B BY QuotePull(thing1);
>
> -- This groups a series of metric by group_name. Data looks good here too.
> D = GROUP C BY group_name;
>
> -- What I want from the next step:
> The group_name, and all values for the line in its grouping that has the
> highest value for one of its metrics.
>
> I tried to do it in a FOREACH, using ORDER and LIMIT on elements of D, but I
> do not how to access elements of the groups when iterating D.
>
> Something like this:
>
> HCOUNT = FOREACH D {
> P = ORDER <do not know what to put here> BY $2;
> K = LIMIT P 1;
> GENERATE group, K.$1, K.$2;
> };
>
> But you can¹t apparently do that. There is no example in the language
> manual of referring to the items in the thing you are looping, in a FOREACH.
>
> So next, I selected the highest value using MAX, and then I try to JOIN it
> back in... The problem is, I have no idea how to JOIN datasets by two values
> in Pig:
>
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>
> The first has a syntax error. The second gives: ERROR 1094: Attempt to
> insert between two nodes that were not connected.
>
> I¹m stumped. I think this must be doable without writing a UDF. How?
>
> --
> Russell Jurney, Viz
> rjurney@ning.com
> 404-317-3620
>
>
Re: Query Question
Posted by Russell Jurney <rj...@ning.com>.
Thanks!
On 10/13/09 4:52 PM, "Ashutosh Chauhan" <as...@gmail.com> wrote:
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>>
>> The first has a syntax error. The second gives: ERROR 1094: Attempt to
>> insert between two nodes that were not connected.
>>
>>
> This indeed is incorrect syntax. If you want to do join on expressions, you
> need to put parenthesis around them. For ex:
>
> RESULT = JOIN C BY (group_name, metric1), HCOUNT BY ($0, $1);
>
> Hope it helps,
> Ashutosh
--
Russell Jurney, Viz
rjurney@ning.com
404-317-3620
Re: Query Question
Posted by Ashutosh Chauhan <as...@gmail.com>.
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>
> The first has a syntax error. The second gives: ERROR 1094: Attempt to
> insert between two nodes that were not connected.
>
>
This indeed is incorrect syntax. If you want to do join on expressions, you
need to put parenthesis around them. For ex:
RESULT = JOIN C BY (group_name, metric1), HCOUNT BY ($0, $1);
Hope it helps,
Ashutosh