You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Russell Jurney <rj...@ning.com> on 2009/10/14 01:35:30 UTC

Query Question

Sorry if this double sends, just subscribed from this account.

I am running the following query, and am having trouble getting what I want.
I have two data sources, and I am JOINING them, then grouping by a value.  I
then want to select the group name, and the row in the group that contains
the highest value of one of its metrics.

-- a list keyed by things with some other stuff
A = LOAD '/my/list/of/things/' USING PigStorage(',') AS
(group_name:chararray, thing1:chararray, thing2:chararray);

-- Load other source
B = LOAD '/my/time/series/metrics' USING PigStorage(,¹)
AS(thing1:chararray, metric1:chararray, metric2:chararray);

-- I have verified that the JOIN here is good data.
C = JOIN A BY QuotePull(thing1), B BY QuotePull(thing1);

-- This groups a series of metric by group_name.  Data looks good here too.
D = GROUP C BY group_name;

-- What I want from the next step:
The group_name, and all values for the line in its grouping that has the
highest value for one of its metrics.

I tried to do it in a FOREACH, using ORDER and LIMIT on elements of D, but I
do not how to access elements of the groups when iterating D.

Something like this:

HCOUNT = FOREACH D {
        P = ORDER <do not know what to put here> BY $2;
        K = LIMIT P 1;
        GENERATE group, K.$1, K.$2;
    };

But you can¹t apparently do that.  There is no example in the language
manual of referring to the items in the thing you are looping, in a FOREACH.

So next, I selected the highest value using MAX, and then I try to JOIN it
back in... The problem is, I have no idea how to JOIN datasets by two values
in Pig:

RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0,  $1;
RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;

The first has a syntax error.  The second gives: ERROR 1094: Attempt to
insert between two nodes that were not connected.

I¹m stumped.  I think this must be doable without writing a UDF.  How?

-- 
Russell Jurney, Viz
rjurney@ning.com
404-317-3620

Re: Query Question

Posted by Russell Jurney <rj...@ning.com>.

Thanks, I've gotten the following to work and it generates my desired
result:

HCOUNT = FOREACH D { K = ORDER C BY my_metric; L = LIMIT K 1; GENERATE
L.(name1, name2); };

However, I cannot make it generate L.name1, L.name2, or anything with two
values. Can you only generate one thing when referring to a new set inside a
FOREACH?  What is the rule here?

On 10/13/09 4:54 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

> <do not know what to put here> is the relation name that you grouped,
> in this case, "C".
> 
> On Tue, Oct 13, 2009 at 7:35 PM, Russell Jurney <rj...@ning.com> wrote:
>> Sorry if this double sends, just subscribed from this account.
>> 
>> I am running the following query, and am having trouble getting what I want.
>> I have two data sources, and I am JOINING them, then grouping by a value.  I
>> then want to select the group name, and the row in the group that contains
>> the highest value of one of its metrics.
>> 
>> -- a list keyed by things with some other stuff
>> A = LOAD '/my/list/of/things/' USING PigStorage(',') AS
>> (group_name:chararray, thing1:chararray, thing2:chararray);
>> 
>> -- Load other source
>> B = LOAD '/my/time/series/metrics' USING PigStorage(Œ,¹)
>> AS(thing1:chararray, metric1:chararray, metric2:chararray);
>> 
>> -- I have verified that the JOIN here is good data.
>> C = JOIN A BY QuotePull(thing1), B BY QuotePull(thing1);
>> 
>> -- This groups a series of metric by group_name.  Data looks good here too.
>> D = GROUP C BY group_name;
>> 
>> -- What I want from the next step:
>> The group_name, and all values for the line in its grouping that has the
>> highest value for one of its metrics.
>> 
>> I tried to do it in a FOREACH, using ORDER and LIMIT on elements of D, but I
>> do not how to access elements of the groups when iterating D.
>> 
>> Something like this:
>> 
>> HCOUNT = FOREACH D {
>>        P = ORDER <do not know what to put here> BY $2;
>>        K = LIMIT P 1;
>>        GENERATE group, K.$1, K.$2;
>>    };
>> 
>> But you can¹t apparently do that.  There is no example in the language
>> manual of referring to the items in the thing you are looping, in a FOREACH.
>> 
>> So next, I selected the highest value using MAX, and then I try to JOIN it
>> back in... The problem is, I have no idea how to JOIN datasets by two values
>> in Pig:
>> 
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0,  $1;
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>> 
>> The first has a syntax error.  The second gives: ERROR 1094: Attempt to
>> insert between two nodes that were not connected.
>> 
>> I¹m stumped.  I think this must be doable without writing a UDF.  How?
>> 
>> --
>> Russell Jurney, Viz
>> rjurney@ning.com
>> 404-317-3620
>> 
>> 

-- 
Russell Jurney, Viz
rjurney@ning.com

Re: Query Question

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

<do not know what to put here> is the relation name that you grouped,
in this case, "C".

On Tue, Oct 13, 2009 at 7:35 PM, Russell Jurney <rj...@ning.com> wrote:
> Sorry if this double sends, just subscribed from this account.
>
> I am running the following query, and am having trouble getting what I want.
> I have two data sources, and I am JOINING them, then grouping by a value.  I
> then want to select the group name, and the row in the group that contains
> the highest value of one of its metrics.
>
> -- a list keyed by things with some other stuff
> A = LOAD '/my/list/of/things/' USING PigStorage(',') AS
> (group_name:chararray, thing1:chararray, thing2:chararray);
>
> -- Load other source
> B = LOAD '/my/time/series/metrics' USING PigStorage(Œ,¹)
> AS(thing1:chararray, metric1:chararray, metric2:chararray);
>
> -- I have verified that the JOIN here is good data.
> C = JOIN A BY QuotePull(thing1), B BY QuotePull(thing1);
>
> -- This groups a series of metric by group_name.  Data looks good here too.
> D = GROUP C BY group_name;
>
> -- What I want from the next step:
> The group_name, and all values for the line in its grouping that has the
> highest value for one of its metrics.
>
> I tried to do it in a FOREACH, using ORDER and LIMIT on elements of D, but I
> do not how to access elements of the groups when iterating D.
>
> Something like this:
>
> HCOUNT = FOREACH D {
>        P = ORDER <do not know what to put here> BY $2;
>        K = LIMIT P 1;
>        GENERATE group, K.$1, K.$2;
>    };
>
> But you can¹t apparently do that.  There is no example in the language
> manual of referring to the items in the thing you are looping, in a FOREACH.
>
> So next, I selected the highest value using MAX, and then I try to JOIN it
> back in... The problem is, I have no idea how to JOIN datasets by two values
> in Pig:
>
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0,  $1;
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>
> The first has a syntax error.  The second gives: ERROR 1094: Attempt to
> insert between two nodes that were not connected.
>
> I¹m stumped.  I think this must be doable without writing a UDF.  How?
>
> --
> Russell Jurney, Viz
> rjurney@ning.com
> 404-317-3620
>
>

Re: Query Question

Posted by Russell Jurney <rj...@ning.com>.

Thanks!

On 10/13/09 4:52 PM, "Ashutosh Chauhan" <as...@gmail.com> wrote:

>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0,  $1;
>> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>> 
>> The first has a syntax error.  The second gives: ERROR 1094: Attempt to
>> insert between two nodes that were not connected.
>> 
>> 
> This indeed is incorrect syntax. If you want to do join on expressions, you
> need to put parenthesis around them. For ex:
> 
> RESULT = JOIN C BY (group_name, metric1), HCOUNT BY ($0,  $1);
> 
> Hope it helps,
> Ashutosh

-- 
Russell Jurney, Viz
rjurney@ning.com
404-317-3620

Re: Query Question

Posted by Ashutosh Chauhan <as...@gmail.com>.

> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0,  $1;
> RESULT = JOIN C BY group_name, metric1, HCOUNT BY $0, $1;
>
> The first has a syntax error.  The second gives: ERROR 1094: Attempt to
> insert between two nodes that were not connected.
>
>
This indeed is incorrect syntax. If you want to do join on expressions, you
need to put parenthesis around them. For ex:

RESULT = JOIN C BY (group_name, metric1), HCOUNT BY ($0,  $1);

Hope it helps,
Ashutosh