You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Greg Langmead <gl...@languageweaver.com> on 2010/05/05 23:06:37 UTC

Help identifying missing value

At an intermediate point in my processing, I have these tuples:

DUMP X;
(A,1L,1L)
(A,2L,2L)
(A,3L,6L)
(A,5L,1L)

The middle element of these tuples can have any integer value from 1-5, and the third element can have any positive integer value. (These data points mean, for example for the third tuple, "I saw 6 distinct words that started with the letter A that occurred 3 times each.") My problem is that to do the math I need to do next, I need to know that there were 0 words that occurred 4 times, so I need to group these four tuples into one record that permits me to ask "what is the value that goes with 1, ... what is the value that goes with 5".

I could stream these through a script and do what I want, but I'm new to Pig and I'd like to explore what can be done strictly within Pig.

Maybe I could gather these into a tuple, but with a 0 at the position for 4:

($-NT,1L,2L,6L,0L,1L)

or else somehow generate a map from this:

($NT, 1L#1L, 2L#2L, 3L#6L, 5L#1L)

which would also alert me to the absence of 4L. Can I do either of these things?

Thanks,
Greg Langmead
Research Scientist
Language Weaver, Inc.

RE: Help identifying missing value

Posted by Richard Ding <rd...@yahoo-inc.com>.
Right now there is no UDF that converts a bag of tuples into a map. But
you can always write one :-) 

Thanks,
-Richard
-----Original Message-----
From: gimmy.goku@gmail.com [mailto:gimmy.goku@gmail.com] On Behalf Of
Gianmarco
Sent: Thursday, May 06, 2010 10:53 AM
To: pig-user@hadoop.apache.org
Subject: Re: Help identifying missing value

Is it possible to generate a map inside a foreach?


Something like :

 a = load 'input' USING PigStorage(',') AS
(l:chararray,n1:long,n2:long);
 b = group a BY l;
 c = foreach b { ones = filter a BY n1 == 1; GENERATE FLATTEN([1#ones])
;};

(Of course this does not compile, but I didn' t manage to generate even
a
simple map like [1#2] in a programmatic way, so there must be something
wrong with my approach)


Gianmarco




On Thu, May 6, 2010 at 19:13, Richard Ding <rd...@yahoo-inc.com> wrote:

> Using group by and foreach you can get tuples like this:
>
> (A, {(1L,1L),(2L,2L),(3L,6L),(5L,1L)})
>
> By counting the number of tuples in the bag, you can then find the
> missing values.
>
> Here is the script:
>
> L = load 'X' using PigStorage(',') as (a:chararray, b:long, c:long);
> G = group L by a;
> F = foreach G { O = order L by b; generate group, O.(b, c); }
> dump F
>
> Thanks
> -Richard
>
> -----Original Message-----
> From: Greg Langmead [mailto:glangmead@languageweaver.com]
> Sent: Wednesday, May 05, 2010 2:15 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Help identifying missing value
>
> My example of a combined tuple should have A and not $-NT or $NT, and
> same for the map:
>
> (A, 1L, 2L, 6L, 0L, 1L)
>
> (A, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
>
> On May 5, 2010, at 5:06 PM, Greg Langmead wrote:
>
> > At an intermediate point in my processing, I have these tuples:
> >
> > DUMP X;
> > (A,1L,1L)
> > (A,2L,2L)
> > (A,3L,6L)
> > (A,5L,1L)
> >
> > The middle element of these tuples can have any integer value from
> 1-5, and the third element can have any positive integer value. (These
> data points mean, for example for the third tuple, "I saw 6 distinct
> words that started with the letter A that occurred 3 times each.") My
> problem is that to do the math I need to do next, I need to know that
> there were 0 words that occurred 4 times, so I need to group these
four
> tuples into one record that permits me to ask "what is the value that
> goes with 1, ... what is the value that goes with 5".
> >
> > I could stream these through a script and do what I want, but I'm
new
> to Pig and I'd like to explore what can be done strictly within Pig.
> >
> > Maybe I could gather these into a tuple, but with a 0 at the
position
> for 4:
> >
> > ($-NT,1L,2L,6L,0L,1L)
> >
> > or else somehow generate a map from this:
> >
> > ($NT, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
> >
> > which would also alert me to the absence of 4L. Can I do either of
> these things?
> >
> > Thanks,
> > Greg Langmead
> > Research Scientist
> > Language Weaver, Inc.
>
>

Re: Help identifying missing value

Posted by Gianmarco <gi...@gmail.com>.
Is it possible to generate a map inside a foreach?


Something like :

 a = load 'input' USING PigStorage(',') AS (l:chararray,n1:long,n2:long);
 b = group a BY l;
 c = foreach b { ones = filter a BY n1 == 1; GENERATE FLATTEN([1#ones]) ;};

(Of course this does not compile, but I didn' t manage to generate even a
simple map like [1#2] in a programmatic way, so there must be something
wrong with my approach)


Gianmarco




On Thu, May 6, 2010 at 19:13, Richard Ding <rd...@yahoo-inc.com> wrote:

> Using group by and foreach you can get tuples like this:
>
> (A, {(1L,1L),(2L,2L),(3L,6L),(5L,1L)})
>
> By counting the number of tuples in the bag, you can then find the
> missing values.
>
> Here is the script:
>
> L = load 'X' using PigStorage(',') as (a:chararray, b:long, c:long);
> G = group L by a;
> F = foreach G { O = order L by b; generate group, O.(b, c); }
> dump F
>
> Thanks
> -Richard
>
> -----Original Message-----
> From: Greg Langmead [mailto:glangmead@languageweaver.com]
> Sent: Wednesday, May 05, 2010 2:15 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Help identifying missing value
>
> My example of a combined tuple should have A and not $-NT or $NT, and
> same for the map:
>
> (A, 1L, 2L, 6L, 0L, 1L)
>
> (A, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
>
> On May 5, 2010, at 5:06 PM, Greg Langmead wrote:
>
> > At an intermediate point in my processing, I have these tuples:
> >
> > DUMP X;
> > (A,1L,1L)
> > (A,2L,2L)
> > (A,3L,6L)
> > (A,5L,1L)
> >
> > The middle element of these tuples can have any integer value from
> 1-5, and the third element can have any positive integer value. (These
> data points mean, for example for the third tuple, "I saw 6 distinct
> words that started with the letter A that occurred 3 times each.") My
> problem is that to do the math I need to do next, I need to know that
> there were 0 words that occurred 4 times, so I need to group these four
> tuples into one record that permits me to ask "what is the value that
> goes with 1, ... what is the value that goes with 5".
> >
> > I could stream these through a script and do what I want, but I'm new
> to Pig and I'd like to explore what can be done strictly within Pig.
> >
> > Maybe I could gather these into a tuple, but with a 0 at the position
> for 4:
> >
> > ($-NT,1L,2L,6L,0L,1L)
> >
> > or else somehow generate a map from this:
> >
> > ($NT, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
> >
> > which would also alert me to the absence of 4L. Can I do either of
> these things?
> >
> > Thanks,
> > Greg Langmead
> > Research Scientist
> > Language Weaver, Inc.
>
>

RE: Help identifying missing value

Posted by Richard Ding <rd...@yahoo-inc.com>.
Using group by and foreach you can get tuples like this:

(A, {(1L,1L),(2L,2L),(3L,6L),(5L,1L)})

By counting the number of tuples in the bag, you can then find the
missing values.

Here is the script:

L = load 'X' using PigStorage(',') as (a:chararray, b:long, c:long);
G = group L by a;
F = foreach G { O = order L by b; generate group, O.(b, c); }
dump F

Thanks
-Richard

-----Original Message-----
From: Greg Langmead [mailto:glangmead@languageweaver.com] 
Sent: Wednesday, May 05, 2010 2:15 PM
To: pig-user@hadoop.apache.org
Subject: Re: Help identifying missing value

My example of a combined tuple should have A and not $-NT or $NT, and
same for the map:

(A, 1L, 2L, 6L, 0L, 1L)

(A, 1L#1L, 2L#2L, 3L#6L, 5L#1L)

On May 5, 2010, at 5:06 PM, Greg Langmead wrote:

> At an intermediate point in my processing, I have these tuples:
> 
> DUMP X;
> (A,1L,1L)
> (A,2L,2L)
> (A,3L,6L)
> (A,5L,1L)
> 
> The middle element of these tuples can have any integer value from
1-5, and the third element can have any positive integer value. (These
data points mean, for example for the third tuple, "I saw 6 distinct
words that started with the letter A that occurred 3 times each.") My
problem is that to do the math I need to do next, I need to know that
there were 0 words that occurred 4 times, so I need to group these four
tuples into one record that permits me to ask "what is the value that
goes with 1, ... what is the value that goes with 5".
> 
> I could stream these through a script and do what I want, but I'm new
to Pig and I'd like to explore what can be done strictly within Pig.
> 
> Maybe I could gather these into a tuple, but with a 0 at the position
for 4:
> 
> ($-NT,1L,2L,6L,0L,1L)
> 
> or else somehow generate a map from this:
> 
> ($NT, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
> 
> which would also alert me to the absence of 4L. Can I do either of
these things?
> 
> Thanks,
> Greg Langmead
> Research Scientist
> Language Weaver, Inc.


Re: Help identifying missing value

Posted by Greg Langmead <gl...@languageweaver.com>.
My example of a combined tuple should have A and not $-NT or $NT, and same for the map:

(A, 1L, 2L, 6L, 0L, 1L)

(A, 1L#1L, 2L#2L, 3L#6L, 5L#1L)

On May 5, 2010, at 5:06 PM, Greg Langmead wrote:

> At an intermediate point in my processing, I have these tuples:
> 
> DUMP X;
> (A,1L,1L)
> (A,2L,2L)
> (A,3L,6L)
> (A,5L,1L)
> 
> The middle element of these tuples can have any integer value from 1-5, and the third element can have any positive integer value. (These data points mean, for example for the third tuple, "I saw 6 distinct words that started with the letter A that occurred 3 times each.") My problem is that to do the math I need to do next, I need to know that there were 0 words that occurred 4 times, so I need to group these four tuples into one record that permits me to ask "what is the value that goes with 1, ... what is the value that goes with 5".
> 
> I could stream these through a script and do what I want, but I'm new to Pig and I'd like to explore what can be done strictly within Pig.
> 
> Maybe I could gather these into a tuple, but with a 0 at the position for 4:
> 
> ($-NT,1L,2L,6L,0L,1L)
> 
> or else somehow generate a map from this:
> 
> ($NT, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
> 
> which would also alert me to the absence of 4L. Can I do either of these things?
> 
> Thanks,
> Greg Langmead
> Research Scientist
> Language Weaver, Inc.