You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Christian <en...@gmail.com> on 2011/05/06 23:14:26 UTC

Working with an unknown number of values

I am sorry if this has been asked in the past. I can't seem to find
information on it.

I have two questions, but they are somewhat related.

#1) Let's say you are tracking messages and extracting the hash tags from
the message and storing them as one field (#hash1#hash2#hash3). This means
you might have a line that looks something like the following:
      2343    2011-05-06T03:04:00.000Z    username
some+message+goes+here#with+#hash+#tags    #with#hash#tags   some    other
 info

How can I get the # of tweets per hash tag? Also, how can I get the # of
tweets per user per hash tag?
I know I can use the STRSPLIT function to split on '#'. That will give me a
bag of hash tags. How can I then group by these such that each hash tag has
a set of tweets?


#2) Let's say you have a field that has a fairly small, but still unknown
number of unique values (say between 20-5). I know I can group by these
fields to get a count by doing something like so:

A = LOAD '/some/dir' Using PigStorage (date, directive);

B = GROUP A by (date, directive);

C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);

    But now I want to end up something like the following:

2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
 1983

If I knew the directives ahead of time, I know I can do something like the
following:

D = GROUP C BY date;

E = FOREACH D {
     DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
     DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
     DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
        GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
}

But how do I do this w/o having to hardcode the filters? Am I thinking about
this all wrong?

Thanks very much for you help,
Christian

Re: Working with an unknown number of values

Posted by jacob <ja...@gmail.com>.

On Fri, 2011-05-06 at 16:06 -0600, Christian wrote:
> Thank you for taking the time to explain this to me Jacob!
> 
> Am I stuck with hard-coding for my other question?
> 
> Instead of:
> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>  1983
> --
> 2011-05-01    32423    3433    1983
> 
> would also do as long as I could count on the column order.
> 
The best you can do here is to do a GROUP BY on the date (like Xiaomeng
Wan indicated) after you get the counts. Then, I think theres a UDF in
contrib that allows you to go from a bag to a tuple.

Best of luck,

--jacob
@thedatachef

> 
> Thanks,
> Christian
> On Fri, May 6, 2011 at 3:59 PM, jacob <ja...@gmail.com> wrote:
> 
> > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> > > >
> > > > > #1) Let's say you are tracking messages and extracting the hash tags
> > from
> > > > > the message and storing them as one field (#hash1#hash2#hash3). This
> > > > means
> > > > > you might have a line that looks something like the following:
> > > > >       2343    2011-05-06T03:04:00.000Z    username
> > > > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> > > >  other
> > > > >  info
> > > > >
> > > > > How can I get the # of tweets per hash tag? Also, how can I get the #
> > of
> > > > > tweets per user per hash tag?
> > > > > I know I can use the STRSPLIT function to split on '#'. That will
> > give me
> > > > a
> > > > > bag of hash tags. How can I then group by these such that each hash
> > tag
> > > > has
> > > > > a set of tweets?
> > > > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> > > > the hashtag itself.
> > > >
> > >
> > > If each message has an unknown number of hashtags, will a 'FLATTEN' given
> > me
> > > an unknown # of fields? If so, how do I know which field to group by? I
> > > don't want to group by messages that have the exact hash tags. I want all
> > > messages that have one of the hash tags.
> >
> > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> > NOT a bag. If you could get a bag then you could do the following (I'm
> > throwing out some fields for now):
> >
> > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> > hashtag;
> > C = GROUP B BY hastag;
> >
> > Then C will contain a key (the hashtag) and a bag containing all the
> > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> > to do the same as STRSPLIT but that returns a bag instead.
> >
> > ie.
> >
> > #foobar tweet text,#foobar
> > this tweet has #two #hashtags,#two#hashtags
> > another #foobar tweet,#foobar
> >
> > will yield:
> >
> > #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> > #foobar)}
> > #two,      {(this tweet has #two #hashtags, #two)}
> > #hashtags, {(this tweet has #two #hashtags, #hashtags)}
> >
> >
> > >
> > >
> > > > >     But now I want to end up something like the following:
> > > >
> > > >
> > > > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> > > > >  1983
> > > > >
> > > > > If I knew the directives ahead of time, I know I can do something
> > like
> > > > the
> > > > > following:
> > > > >
> > > > > D = GROUP C BY date;
> > > > >
> > > > > E = FOREACH D {
> > > > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> > > > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> > > > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> > > > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> > > > 'DIRECTIVE2',
> > > > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> > > > > }
> > > > >
> > > > > But how do I do this w/o having to hardcode the filters? Am I
> > thinking
> > > > about
> > > > > this all wrong?
> > > > >
> > > > It's really a matter of how you structure your data ahead of time.
> > > > Imagine the data looking like this instead (call it X):
> > > >
> > > > 201101,directive1
> > > > 201101,directive1
> > > > 201101,directive2
> > > > 201101,directive2
> > > > 201101,directive2
> > > > 201101,directive3
> > > > 201102,directive2
> > > > 201102,directive4
> > > > 201103,directive1
> > > >
> > > > This is how my data looks (row and column wise)
> > >
> > > >
> > > > then, a simple:
> > > >
> > > > Y = GROUP X BY (date,directive);
> > > > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> > > > num_occurrences;
> > > >
> > > > would result in:
> > > >
> > > > 201101,directive1,2
> > > > 201101,directive2,3
> > > > 201101,directive3,1
> > > > 201102,directive2,1
> > > > 201102,directive4,1
> > > > 201103,directive1,1
> > > >
> > > > At least, that's what it _seems_ like you're asking for.
> > > >
> > > > I've gotten that far. I'm actually asking for the being able to put
> > those
> > > into columns and not rows.
> > >
> > > >
> > > > --jacob
> > > > @thedatachef
> > > >
> > > > Thanks Jacob!
> > >
> > > -Christian
> > >
> > > >
> > > > > Thanks very much for you help,
> > > > > Christian
> > > >
> > > >
> > > >
> >
> >
> >

Re: Working with an unknown number of values

Posted by Christian <en...@gmail.com>.

Thank you for taking the time to explain this to me Jacob!

Am I stuck with hard-coding for my other question?

Instead of:
2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
 1983
--
2011-05-01    32423    3433    1983

would also do as long as I could count on the column order.


Thanks,
Christian
On Fri, May 6, 2011 at 3:59 PM, jacob <ja...@gmail.com> wrote:

> On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> > >
> > > > #1) Let's say you are tracking messages and extracting the hash tags
> from
> > > > the message and storing them as one field (#hash1#hash2#hash3). This
> > > means
> > > > you might have a line that looks something like the following:
> > > >       2343    2011-05-06T03:04:00.000Z    username
> > > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> > >  other
> > > >  info
> > > >
> > > > How can I get the # of tweets per hash tag? Also, how can I get the #
> of
> > > > tweets per user per hash tag?
> > > > I know I can use the STRSPLIT function to split on '#'. That will
> give me
> > > a
> > > > bag of hash tags. How can I then group by these such that each hash
> tag
> > > has
> > > > a set of tweets?
> > > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> > > the hashtag itself.
> > >
> >
> > If each message has an unknown number of hashtags, will a 'FLATTEN' given
> me
> > an unknown # of fields? If so, how do I know which field to group by? I
> > don't want to group by messages that have the exact hash tags. I want all
> > messages that have one of the hash tags.
>
> Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> NOT a bag. If you could get a bag then you could do the following (I'm
> throwing out some fields for now):
>
> A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> hashtag;
> C = GROUP B BY hastag;
>
> Then C will contain a key (the hashtag) and a bag containing all the
> tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> to do the same as STRSPLIT but that returns a bag instead.
>
> ie.
>
> #foobar tweet text,#foobar
> this tweet has #two #hashtags,#two#hashtags
> another #foobar tweet,#foobar
>
> will yield:
>
> #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> #foobar)}
> #two,      {(this tweet has #two #hashtags, #two)}
> #hashtags, {(this tweet has #two #hashtags, #hashtags)}
>
>
> >
> >
> > > >     But now I want to end up something like the following:
> > >
> > >
> > > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> > > >  1983
> > > >
> > > > If I knew the directives ahead of time, I know I can do something
> like
> > > the
> > > > following:
> > > >
> > > > D = GROUP C BY date;
> > > >
> > > > E = FOREACH D {
> > > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> > > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> > > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> > > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> > > 'DIRECTIVE2',
> > > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> > > > }
> > > >
> > > > But how do I do this w/o having to hardcode the filters? Am I
> thinking
> > > about
> > > > this all wrong?
> > > >
> > > It's really a matter of how you structure your data ahead of time.
> > > Imagine the data looking like this instead (call it X):
> > >
> > > 201101,directive1
> > > 201101,directive1
> > > 201101,directive2
> > > 201101,directive2
> > > 201101,directive2
> > > 201101,directive3
> > > 201102,directive2
> > > 201102,directive4
> > > 201103,directive1
> > >
> > > This is how my data looks (row and column wise)
> >
> > >
> > > then, a simple:
> > >
> > > Y = GROUP X BY (date,directive);
> > > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> > > num_occurrences;
> > >
> > > would result in:
> > >
> > > 201101,directive1,2
> > > 201101,directive2,3
> > > 201101,directive3,1
> > > 201102,directive2,1
> > > 201102,directive4,1
> > > 201103,directive1,1
> > >
> > > At least, that's what it _seems_ like you're asking for.
> > >
> > > I've gotten that far. I'm actually asking for the being able to put
> those
> > into columns and not rows.
> >
> > >
> > > --jacob
> > > @thedatachef
> > >
> > > Thanks Jacob!
> >
> > -Christian
> >
> > >
> > > > Thanks very much for you help,
> > > > Christian
> > >
> > >
> > >
>
>
>

Re: Working with an unknown number of values

Posted by Alan Gates <ga...@yahoo-inc.com>.

TOKENIZE takes a string and returns a bag.  It's issue is right now it  
only allows you to split on whitespace.  It would make sense to  
generalize this to take a delimiter.

Alan.

On May 7, 2011, at 7:55 PM, Jacob Perkins wrote:

> Dmitriy,
>
>   I see your point. It would definitely be nice to have a builtin for
> returning a bag though. I'd actually be happy if
> TOBAG(FLATTEN(STRSPLIT(X,','))) worked.
>
> --jacob
> @thedatachef
>
> On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
>> FWIW -- the reason STRSPLIT returns a Tuple is that the more common
>> case is thought to be splitting a string of a known format and trying
>> to get some part of it.
>>
>> so, "foreach address_book generate STRSPLIT(phone_number, '-') as
>> (area_code, top_3, bottom_4);"
>>
>> RegexExtractAll (whatever it's called these days) should return a  
>> bag, iirc.
>>
>> D
>>
>> On Fri, May 6, 2011 at 2:59 PM, jacob <ja...@gmail.com>  
>> wrote:
>>> On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
>>>>>
>>>>>> #1) Let's say you are tracking messages and extracting the hash  
>>>>>> tags from
>>>>>> the message and storing them as one field (#hash1#hash2#hash3).  
>>>>>> This
>>>>> means
>>>>>> you might have a line that looks something like the following:
>>>>>>      2343    2011-05-06T03:04:00.000Z    username
>>>>>> some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
>>>>> other
>>>>>> info
>>>>>>
>>>>>> How can I get the # of tweets per hash tag? Also, how can I get  
>>>>>> the # of
>>>>>> tweets per user per hash tag?
>>>>>> I know I can use the STRSPLIT function to split on '#'. That  
>>>>>> will give me
>>>>> a
>>>>>> bag of hash tags. How can I then group by these such that each  
>>>>>> hash tag
>>>>> has
>>>>>> a set of tweets?
>>>>> You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP  
>>>>> BY' on
>>>>> the hashtag itself.
>>>>>
>>>>
>>>> If each message has an unknown number of hashtags, will a  
>>>> 'FLATTEN' given me
>>>> an unknown # of fields? If so, how do I know which field to group  
>>>> by? I
>>>> don't want to group by messages that have the exact hash tags. I  
>>>> want all
>>>> messages that have one of the hash tags.
>>>
>>> Oh, that's right, STRSPLIT (rather uselessly) yields a nested  
>>> tuple and
>>> NOT a bag. If you could get a bag then you could do the following  
>>> (I'm
>>> throwing out some fields for now):
>>>
>>> A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
>>> B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
>>> hashtag;
>>> C = GROUP B BY hastag;
>>>
>>> Then C will contain a key (the hashtag) and a bag containing all the
>>> tweets with that hashtag. You'll have to write 'MySplittingUDF'  
>>> yourself
>>> to do the same as STRSPLIT but that returns a bag instead.
>>>
>>> ie.
>>>
>>> #foobar tweet text,#foobar
>>> this tweet has #two #hashtags,#two#hashtags
>>> another #foobar tweet,#foobar
>>>
>>> will yield:
>>>
>>> #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
>>> #foobar)}
>>> #two,      {(this tweet has #two #hashtags, #two)}
>>> #hashtags, {(this tweet has #two #hashtags, #hashtags)}
>>>
>>>
>>>>
>>>>
>>>>>>    But now I want to end up something like the following:
>>>>>
>>>>>
>>>>>> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433     
>>>>>> DIRECTIVE3
>>>>>> 1983
>>>>>>
>>>>>> If I knew the directives ahead of time, I know I can do  
>>>>>> something like
>>>>> the
>>>>>> following:
>>>>>>
>>>>>> D = GROUP C BY date;
>>>>>>
>>>>>> E = FOREACH D {
>>>>>>     DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>>>>>>     DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>>>>>>     DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>>>>>>        GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
>>>>> 'DIRECTIVE2',
>>>>>> COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
>>>>>> }
>>>>>>
>>>>>> But how do I do this w/o having to hardcode the filters? Am I  
>>>>>> thinking
>>>>> about
>>>>>> this all wrong?
>>>>>>
>>>>> It's really a matter of how you structure your data ahead of time.
>>>>> Imagine the data looking like this instead (call it X):
>>>>>
>>>>> 201101,directive1
>>>>> 201101,directive1
>>>>> 201101,directive2
>>>>> 201101,directive2
>>>>> 201101,directive2
>>>>> 201101,directive3
>>>>> 201102,directive2
>>>>> 201102,directive4
>>>>> 201103,directive1
>>>>>
>>>>> This is how my data looks (row and column wise)
>>>>
>>>>>
>>>>> then, a simple:
>>>>>
>>>>> Y = GROUP X BY (date,directive);
>>>>> Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive),  
>>>>> COUNT(X) AS
>>>>> num_occurrences;
>>>>>
>>>>> would result in:
>>>>>
>>>>> 201101,directive1,2
>>>>> 201101,directive2,3
>>>>> 201101,directive3,1
>>>>> 201102,directive2,1
>>>>> 201102,directive4,1
>>>>> 201103,directive1,1
>>>>>
>>>>> At least, that's what it _seems_ like you're asking for.
>>>>>
>>>>> I've gotten that far. I'm actually asking for the being able to  
>>>>> put those
>>>> into columns and not rows.
>>>>
>>>>>
>>>>> --jacob
>>>>> @thedatachef
>>>>>
>>>>> Thanks Jacob!
>>>>
>>>> -Christian
>>>>
>>>>>
>>>>>> Thanks very much for you help,
>>>>>> Christian
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>

Re: Working with an unknown number of values

Posted by Jacob Perkins <ja...@gmail.com>.

Dmitriy,
 
   I see your point. It would definitely be nice to have a builtin for
returning a bag though. I'd actually be happy if
TOBAG(FLATTEN(STRSPLIT(X,','))) worked.

--jacob
@thedatachef

On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
> FWIW -- the reason STRSPLIT returns a Tuple is that the more common
> case is thought to be splitting a string of a known format and trying
> to get some part of it.
> 
> so, "foreach address_book generate STRSPLIT(phone_number, '-') as
> (area_code, top_3, bottom_4);"
> 
> RegexExtractAll (whatever it's called these days) should return a bag, iirc.
> 
> D
> 
> On Fri, May 6, 2011 at 2:59 PM, jacob <ja...@gmail.com> wrote:
> > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> >> >
> >> > > #1) Let's say you are tracking messages and extracting the hash tags from
> >> > > the message and storing them as one field (#hash1#hash2#hash3). This
> >> > means
> >> > > you might have a line that looks something like the following:
> >> > >       2343    2011-05-06T03:04:00.000Z    username
> >> > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> >> >  other
> >> > >  info
> >> > >
> >> > > How can I get the # of tweets per hash tag? Also, how can I get the # of
> >> > > tweets per user per hash tag?
> >> > > I know I can use the STRSPLIT function to split on '#'. That will give me
> >> > a
> >> > > bag of hash tags. How can I then group by these such that each hash tag
> >> > has
> >> > > a set of tweets?
> >> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> >> > the hashtag itself.
> >> >
> >>
> >> If each message has an unknown number of hashtags, will a 'FLATTEN' given me
> >> an unknown # of fields? If so, how do I know which field to group by? I
> >> don't want to group by messages that have the exact hash tags. I want all
> >> messages that have one of the hash tags.
> >
> > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> > NOT a bag. If you could get a bag then you could do the following (I'm
> > throwing out some fields for now):
> >
> > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> > hashtag;
> > C = GROUP B BY hastag;
> >
> > Then C will contain a key (the hashtag) and a bag containing all the
> > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> > to do the same as STRSPLIT but that returns a bag instead.
> >
> > ie.
> >
> > #foobar tweet text,#foobar
> > this tweet has #two #hashtags,#two#hashtags
> > another #foobar tweet,#foobar
> >
> > will yield:
> >
> > #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> > #foobar)}
> > #two,      {(this tweet has #two #hashtags, #two)}
> > #hashtags, {(this tweet has #two #hashtags, #hashtags)}
> >
> >
> >>
> >>
> >> > >     But now I want to end up something like the following:
> >> >
> >> >
> >> > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> >> > >  1983
> >> > >
> >> > > If I knew the directives ahead of time, I know I can do something like
> >> > the
> >> > > following:
> >> > >
> >> > > D = GROUP C BY date;
> >> > >
> >> > > E = FOREACH D {
> >> > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> >> > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> >> > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> >> > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> >> > 'DIRECTIVE2',
> >> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> >> > > }
> >> > >
> >> > > But how do I do this w/o having to hardcode the filters? Am I thinking
> >> > about
> >> > > this all wrong?
> >> > >
> >> > It's really a matter of how you structure your data ahead of time.
> >> > Imagine the data looking like this instead (call it X):
> >> >
> >> > 201101,directive1
> >> > 201101,directive1
> >> > 201101,directive2
> >> > 201101,directive2
> >> > 201101,directive2
> >> > 201101,directive3
> >> > 201102,directive2
> >> > 201102,directive4
> >> > 201103,directive1
> >> >
> >> > This is how my data looks (row and column wise)
> >>
> >> >
> >> > then, a simple:
> >> >
> >> > Y = GROUP X BY (date,directive);
> >> > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> >> > num_occurrences;
> >> >
> >> > would result in:
> >> >
> >> > 201101,directive1,2
> >> > 201101,directive2,3
> >> > 201101,directive3,1
> >> > 201102,directive2,1
> >> > 201102,directive4,1
> >> > 201103,directive1,1
> >> >
> >> > At least, that's what it _seems_ like you're asking for.
> >> >
> >> > I've gotten that far. I'm actually asking for the being able to put those
> >> into columns and not rows.
> >>
> >> >
> >> > --jacob
> >> > @thedatachef
> >> >
> >> > Thanks Jacob!
> >>
> >> -Christian
> >>
> >> >
> >> > > Thanks very much for you help,
> >> > > Christian
> >> >
> >> >
> >> >
> >
> >
> >

Re: Working with an unknown number of values

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

FWIW -- the reason STRSPLIT returns a Tuple is that the more common
case is thought to be splitting a string of a known format and trying
to get some part of it.

so, "foreach address_book generate STRSPLIT(phone_number, '-') as
(area_code, top_3, bottom_4);"

RegexExtractAll (whatever it's called these days) should return a bag, iirc.

D

On Fri, May 6, 2011 at 2:59 PM, jacob <ja...@gmail.com> wrote:
> On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
>> >
>> > > #1) Let's say you are tracking messages and extracting the hash tags from
>> > > the message and storing them as one field (#hash1#hash2#hash3). This
>> > means
>> > > you might have a line that looks something like the following:
>> > >       2343    2011-05-06T03:04:00.000Z    username
>> > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
>> >  other
>> > >  info
>> > >
>> > > How can I get the # of tweets per hash tag? Also, how can I get the # of
>> > > tweets per user per hash tag?
>> > > I know I can use the STRSPLIT function to split on '#'. That will give me
>> > a
>> > > bag of hash tags. How can I then group by these such that each hash tag
>> > has
>> > > a set of tweets?
>> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
>> > the hashtag itself.
>> >
>>
>> If each message has an unknown number of hashtags, will a 'FLATTEN' given me
>> an unknown # of fields? If so, how do I know which field to group by? I
>> don't want to group by messages that have the exact hash tags. I want all
>> messages that have one of the hash tags.
>
> Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> NOT a bag. If you could get a bag then you could do the following (I'm
> throwing out some fields for now):
>
> A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> hashtag;
> C = GROUP B BY hastag;
>
> Then C will contain a key (the hashtag) and a bag containing all the
> tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> to do the same as STRSPLIT but that returns a bag instead.
>
> ie.
>
> #foobar tweet text,#foobar
> this tweet has #two #hashtags,#two#hashtags
> another #foobar tweet,#foobar
>
> will yield:
>
> #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> #foobar)}
> #two,      {(this tweet has #two #hashtags, #two)}
> #hashtags, {(this tweet has #two #hashtags, #hashtags)}
>
>
>>
>>
>> > >     But now I want to end up something like the following:
>> >
>> >
>> > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>> > >  1983
>> > >
>> > > If I knew the directives ahead of time, I know I can do something like
>> > the
>> > > following:
>> > >
>> > > D = GROUP C BY date;
>> > >
>> > > E = FOREACH D {
>> > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>> > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>> > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>> > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
>> > 'DIRECTIVE2',
>> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
>> > > }
>> > >
>> > > But how do I do this w/o having to hardcode the filters? Am I thinking
>> > about
>> > > this all wrong?
>> > >
>> > It's really a matter of how you structure your data ahead of time.
>> > Imagine the data looking like this instead (call it X):
>> >
>> > 201101,directive1
>> > 201101,directive1
>> > 201101,directive2
>> > 201101,directive2
>> > 201101,directive2
>> > 201101,directive3
>> > 201102,directive2
>> > 201102,directive4
>> > 201103,directive1
>> >
>> > This is how my data looks (row and column wise)
>>
>> >
>> > then, a simple:
>> >
>> > Y = GROUP X BY (date,directive);
>> > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
>> > num_occurrences;
>> >
>> > would result in:
>> >
>> > 201101,directive1,2
>> > 201101,directive2,3
>> > 201101,directive3,1
>> > 201102,directive2,1
>> > 201102,directive4,1
>> > 201103,directive1,1
>> >
>> > At least, that's what it _seems_ like you're asking for.
>> >
>> > I've gotten that far. I'm actually asking for the being able to put those
>> into columns and not rows.
>>
>> >
>> > --jacob
>> > @thedatachef
>> >
>> > Thanks Jacob!
>>
>> -Christian
>>
>> >
>> > > Thanks very much for you help,
>> > > Christian
>> >
>> >
>> >
>
>
>

Re: Working with an unknown number of values

Posted by jacob <ja...@gmail.com>.

On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> >
> > > #1) Let's say you are tracking messages and extracting the hash tags from
> > > the message and storing them as one field (#hash1#hash2#hash3). This
> > means
> > > you might have a line that looks something like the following:
> > >       2343    2011-05-06T03:04:00.000Z    username
> > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> >  other
> > >  info
> > >
> > > How can I get the # of tweets per hash tag? Also, how can I get the # of
> > > tweets per user per hash tag?
> > > I know I can use the STRSPLIT function to split on '#'. That will give me
> > a
> > > bag of hash tags. How can I then group by these such that each hash tag
> > has
> > > a set of tweets?
> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> > the hashtag itself.
> >
> 
> If each message has an unknown number of hashtags, will a 'FLATTEN' given me
> an unknown # of fields? If so, how do I know which field to group by? I
> don't want to group by messages that have the exact hash tags. I want all
> messages that have one of the hash tags.

Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
NOT a bag. If you could get a bag then you could do the following (I'm
throwing out some fields for now):

A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
hashtag;
C = GROUP B BY hastag;

Then C will contain a key (the hashtag) and a bag containing all the
tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
to do the same as STRSPLIT but that returns a bag instead.

ie.

#foobar tweet text,#foobar
this tweet has #two #hashtags,#two#hashtags
another #foobar tweet,#foobar

will yield:

#foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
#foobar)}
#two,      {(this tweet has #two #hashtags, #two)}
#hashtags, {(this tweet has #two #hashtags, #hashtags)}


> 
> 
> > >     But now I want to end up something like the following:
> >
> >
> > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> > >  1983
> > >
> > > If I knew the directives ahead of time, I know I can do something like
> > the
> > > following:
> > >
> > > D = GROUP C BY date;
> > >
> > > E = FOREACH D {
> > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> > 'DIRECTIVE2',
> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> > > }
> > >
> > > But how do I do this w/o having to hardcode the filters? Am I thinking
> > about
> > > this all wrong?
> > >
> > It's really a matter of how you structure your data ahead of time.
> > Imagine the data looking like this instead (call it X):
> >
> > 201101,directive1
> > 201101,directive1
> > 201101,directive2
> > 201101,directive2
> > 201101,directive2
> > 201101,directive3
> > 201102,directive2
> > 201102,directive4
> > 201103,directive1
> >
> > This is how my data looks (row and column wise)
> 
> >
> > then, a simple:
> >
> > Y = GROUP X BY (date,directive);
> > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> > num_occurrences;
> >
> > would result in:
> >
> > 201101,directive1,2
> > 201101,directive2,3
> > 201101,directive3,1
> > 201102,directive2,1
> > 201102,directive4,1
> > 201103,directive1,1
> >
> > At least, that's what it _seems_ like you're asking for.
> >
> > I've gotten that far. I'm actually asking for the being able to put those
> into columns and not rows.
> 
> >
> > --jacob
> > @thedatachef
> >
> > Thanks Jacob!
> 
> -Christian
> 
> >
> > > Thanks very much for you help,
> > > Christian
> >
> >
> >

Re: Working with an unknown number of values

Posted by Christian <en...@gmail.com>.

>
> > #1) Let's say you are tracking messages and extracting the hash tags from
> > the message and storing them as one field (#hash1#hash2#hash3). This
> means
> > you might have a line that looks something like the following:
> >       2343    2011-05-06T03:04:00.000Z    username
> > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
>  other
> >  info
> >
> > How can I get the # of tweets per hash tag? Also, how can I get the # of
> > tweets per user per hash tag?
> > I know I can use the STRSPLIT function to split on '#'. That will give me
> a
> > bag of hash tags. How can I then group by these such that each hash tag
> has
> > a set of tweets?
> You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> the hashtag itself.
>

If each message has an unknown number of hashtags, will a 'FLATTEN' given me
an unknown # of fields? If so, how do I know which field to group by? I
don't want to group by messages that have the exact hash tags. I want all
messages that have one of the hash tags.


> >     But now I want to end up something like the following:
>
>
> > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> >  1983
> >
> > If I knew the directives ahead of time, I know I can do something like
> the
> > following:
> >
> > D = GROUP C BY date;
> >
> > E = FOREACH D {
> >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> 'DIRECTIVE2',
> > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> > }
> >
> > But how do I do this w/o having to hardcode the filters? Am I thinking
> about
> > this all wrong?
> >
> It's really a matter of how you structure your data ahead of time.
> Imagine the data looking like this instead (call it X):
>
> 201101,directive1
> 201101,directive1
> 201101,directive2
> 201101,directive2
> 201101,directive2
> 201101,directive3
> 201102,directive2
> 201102,directive4
> 201103,directive1
>
> This is how my data looks (row and column wise)

>
> then, a simple:
>
> Y = GROUP X BY (date,directive);
> Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> num_occurrences;
>
> would result in:
>
> 201101,directive1,2
> 201101,directive2,3
> 201101,directive3,1
> 201102,directive2,1
> 201102,directive4,1
> 201103,directive1,1
>
> At least, that's what it _seems_ like you're asking for.
>
> I've gotten that far. I'm actually asking for the being able to put those
into columns and not rows.

>
> --jacob
> @thedatachef
>
> Thanks Jacob!

-Christian

>
> > Thanks very much for you help,
> > Christian
>
>
>

Re: Working with an unknown number of values

Posted by jacob <ja...@gmail.com>.

Christian,

I've answered inline:

On Fri, 2011-05-06 at 15:14 -0600, Christian wrote:
> I am sorry if this has been asked in the past. I can't seem to find
> information on it.
> 
> I have two questions, but they are somewhat related.
> 
> #1) Let's say you are tracking messages and extracting the hash tags from
> the message and storing them as one field (#hash1#hash2#hash3). This means
> you might have a line that looks something like the following:
>       2343    2011-05-06T03:04:00.000Z    username
> some+message+goes+here#with+#hash+#tags    #with#hash#tags   some    other
>  info
> 
> How can I get the # of tweets per hash tag? Also, how can I get the # of
> tweets per user per hash tag?
> I know I can use the STRSPLIT function to split on '#'. That will give me a
> bag of hash tags. How can I then group by these such that each hash tag has
> a set of tweets?
You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
the hashtag itself.

> 
> 
> #2) Let's say you have a field that has a fairly small, but still unknown
> number of unique values (say between 20-5). I know I can group by these
> fields to get a count by doing something like so:
> 
> A = LOAD '/some/dir' Using PigStorage (date, directive);
> 
> B = GROUP A by (date, directive);
> 
> C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);
> 
>     But now I want to end up something like the following:
> 
> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>  1983
> 
> If I knew the directives ahead of time, I know I can do something like the
> following:
> 
> D = GROUP C BY date;
> 
> E = FOREACH D {
>      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
> COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> }
> 
> But how do I do this w/o having to hardcode the filters? Am I thinking about
> this all wrong?
> 
It's really a matter of how you structure your data ahead of time.
Imagine the data looking like this instead (call it X):

201101,directive1
201101,directive1
201101,directive2
201101,directive2
201101,directive2
201101,directive3
201102,directive2
201102,directive4
201103,directive1


then, a simple:

Y = GROUP X BY (date,directive);
Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
num_occurrences;

would result in:

201101,directive1,2
201101,directive2,3
201101,directive3,1
201102,directive2,1
201102,directive4,1
201103,directive1,1

At least, that's what it _seems_ like you're asking for.


--jacob
@thedatachef


> Thanks very much for you help,
> Christian

Re: Working with an unknown number of values

Posted by Xiaomeng Wan <sh...@gmail.com>.

you can group on group, like this:

A = LOAD '/some/dir' Using PigStorage (date, directive);

B = GROUP A by (date, directive);

C = FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as cnt;

D = group c by date;

E = foreach D generate group as date, c.(directive,cnt) as cnts;

Shawn

On Fri, May 6, 2011 at 3:14 PM, Christian <en...@gmail.com> wrote:
> I am sorry if this has been asked in the past. I can't seem to find
> information on it.
>
> I have two questions, but they are somewhat related.
>
> #1) Let's say you are tracking messages and extracting the hash tags from
> the message and storing them as one field (#hash1#hash2#hash3). This means
> you might have a line that looks something like the following:
>      2343    2011-05-06T03:04:00.000Z    username
> some+message+goes+here#with+#hash+#tags    #with#hash#tags   some    other
>  info
>
> How can I get the # of tweets per hash tag? Also, how can I get the # of
> tweets per user per hash tag?
> I know I can use the STRSPLIT function to split on '#'. That will give me a
> bag of hash tags. How can I then group by these such that each hash tag has
> a set of tweets?
>
>
> #2) Let's say you have a field that has a fairly small, but still unknown
> number of unique values (say between 20-5). I know I can group by these
> fields to get a count by doing something like so:
>
> A = LOAD '/some/dir' Using PigStorage (date, directive);
>
> B = GROUP A by (date, directive);
>
> C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);
>
>    But now I want to end up something like the following:
>
> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>  1983
>
> If I knew the directives ahead of time, I know I can do something like the
> following:
>
> D = GROUP C BY date;
>
> E = FOREACH D {
>     DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>     DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>     DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>        GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
> COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> }
>
> But how do I do this w/o having to hardcode the filters? Am I thinking about
> this all wrong?
>
> Thanks very much for you help,
> Christian
>