You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Grig Gheorghiu <gr...@gmail.com> on 2012/01/27 01:02:02 UTC

Non-standard grouping

Let's say I have this dataset:

1,undefined,text1
1,,text2
1,event1,text3
1,undefined,text4
1,event2,text5
1,event3,text6

I would like to group by 1st value, but not quite an ordinary
grouping. I would like all lines that contain either an empty value or
'undefined' on the 2nd position to be rolled up in the first line that
contains a proper value in the 2nd position. So basically I'd like to
obtain this relation:

(1,event1,3)
(1,event2,2)
(1,event3,1)

(where the 3rd value is the count of lines that were seen before a
proper 'event' line was seen).

Is this possible with Pig?

Thanks!

Grig

Re: Non-standard grouping

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Ah, yeah, if you can shrink data down that much, going outside of Pig (or
doing things in a UDF) is the way to go.

D

On Thu, Feb 2, 2012 at 3:45 PM, Grig Gheorghiu <gr...@gmail.com>wrote:

> Hey Dmitriy! Unfortunately that't the requirement. The solution I
> found so far is to do all the pre-filtering and grouping I can in Pig,
> and then run Python on the output file generated by Pig. That file is
> ~ 300 MB, so it's not a problem to just run through Python.
>
> Thanks for getting back to me.
>
> Grig
>
> On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> > "records before" is kind of hard do define in an MR paradigm.
> > I suppose you could group and then run the records through an
> accumulative
> > UDF. But this is feeling very hacky. Is there a more scalable
> > (order-independent) way you can do what you need?
> >
> > On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>wrote:
> >
> >> Could you even do it with an UDF? In a regular programming language
> >> you can easily do it with a sentinel that you keep track of, but in
> >> Pig I can't figure it out....
> >>
> >> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
> >> <pr...@gmail.com> wrote:
> >> > Grig, I am afraid there is nothing built into Pig to do this.
> >> >
> >> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
> >> grig.gheorghiu@gmail.com>wrote:
> >> >
> >> >> The count of lines seen up to and including a proper event value (3
> >> >> lines for event1, 2 for event2, 1 for event3).
> >> >>
> >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
> >> >> <pr...@gmail.com> wrote:
> >> >> > What is the last field in your output?
> >> >> >
> >> >> > (1,event1,3)
> >> >> > (1,event2,2)
> >> >> > (1,event3,1)
> >> >> >
> >> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
> >> >> grig.gheorghiu@gmail.com>wrote:
> >> >> >
> >> >> >> Let's say I have this dataset:
> >> >> >>
> >> >> >> 1,undefined,text1
> >> >> >> 1,,text2
> >> >> >> 1,event1,text3
> >> >> >> 1,undefined,text4
> >> >> >> 1,event2,text5
> >> >> >> 1,event3,text6
> >> >> >>
> >> >> >> I would like to group by 1st value, but not quite an ordinary
> >> >> >> grouping. I would like all lines that contain either an empty
> value
> >> or
> >> >> >> 'undefined' on the 2nd position to be rolled up in the first line
> >> that
> >> >> >> contains a proper value in the 2nd position. So basically I'd
> like to
> >> >> >> obtain this relation:
> >> >> >>
> >> >> >> (1,event1,3)
> >> >> >> (1,event2,2)
> >> >> >> (1,event3,1)
> >> >> >>
> >> >> >> (where the 3rd value is the count of lines that were seen before a
> >> >> >> proper 'event' line was seen).
> >> >> >>
> >> >> >> Is this possible with Pig?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Grig
> >> >> >>
> >> >>
> >>
>

Re: Non-standard grouping

Posted by Grig Gheorghiu <gr...@gmail.com>.

Hey Dmitriy! Unfortunately that't the requirement. The solution I
found so far is to do all the pre-filtering and grouping I can in Pig,
and then run Python on the output file generated by Pig. That file is
~ 300 MB, so it's not a problem to just run through Python.

Thanks for getting back to me.

Grig

On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> "records before" is kind of hard do define in an MR paradigm.
> I suppose you could group and then run the records through an accumulative
> UDF. But this is feeling very hacky. Is there a more scalable
> (order-independent) way you can do what you need?
>
> On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <gr...@gmail.com>wrote:
>
>> Could you even do it with an UDF? In a regular programming language
>> you can easily do it with a sentinel that you keep track of, but in
>> Pig I can't figure it out....
>>
>> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
>> <pr...@gmail.com> wrote:
>> > Grig, I am afraid there is nothing built into Pig to do this.
>> >
>> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
>> grig.gheorghiu@gmail.com>wrote:
>> >
>> >> The count of lines seen up to and including a proper event value (3
>> >> lines for event1, 2 for event2, 1 for event3).
>> >>
>> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
>> >> <pr...@gmail.com> wrote:
>> >> > What is the last field in your output?
>> >> >
>> >> > (1,event1,3)
>> >> > (1,event2,2)
>> >> > (1,event3,1)
>> >> >
>> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
>> >> grig.gheorghiu@gmail.com>wrote:
>> >> >
>> >> >> Let's say I have this dataset:
>> >> >>
>> >> >> 1,undefined,text1
>> >> >> 1,,text2
>> >> >> 1,event1,text3
>> >> >> 1,undefined,text4
>> >> >> 1,event2,text5
>> >> >> 1,event3,text6
>> >> >>
>> >> >> I would like to group by 1st value, but not quite an ordinary
>> >> >> grouping. I would like all lines that contain either an empty value
>> or
>> >> >> 'undefined' on the 2nd position to be rolled up in the first line
>> that
>> >> >> contains a proper value in the 2nd position. So basically I'd like to
>> >> >> obtain this relation:
>> >> >>
>> >> >> (1,event1,3)
>> >> >> (1,event2,2)
>> >> >> (1,event3,1)
>> >> >>
>> >> >> (where the 3rd value is the count of lines that were seen before a
>> >> >> proper 'event' line was seen).
>> >> >>
>> >> >> Is this possible with Pig?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Grig
>> >> >>
>> >>
>>

Re: Non-standard grouping

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

"records before" is kind of hard do define in an MR paradigm.
I suppose you could group and then run the records through an accumulative
UDF. But this is feeling very hacky. Is there a more scalable
(order-independent) way you can do what you need?

On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <gr...@gmail.com>wrote:

> Could you even do it with an UDF? In a regular programming language
> you can easily do it with a sentinel that you keep track of, but in
> Pig I can't figure it out....
>
> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
> <pr...@gmail.com> wrote:
> > Grig, I am afraid there is nothing built into Pig to do this.
> >
> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>wrote:
> >
> >> The count of lines seen up to and including a proper event value (3
> >> lines for event1, 2 for event2, 1 for event3).
> >>
> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
> >> <pr...@gmail.com> wrote:
> >> > What is the last field in your output?
> >> >
> >> > (1,event1,3)
> >> > (1,event2,2)
> >> > (1,event3,1)
> >> >
> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
> >> grig.gheorghiu@gmail.com>wrote:
> >> >
> >> >> Let's say I have this dataset:
> >> >>
> >> >> 1,undefined,text1
> >> >> 1,,text2
> >> >> 1,event1,text3
> >> >> 1,undefined,text4
> >> >> 1,event2,text5
> >> >> 1,event3,text6
> >> >>
> >> >> I would like to group by 1st value, but not quite an ordinary
> >> >> grouping. I would like all lines that contain either an empty value
> or
> >> >> 'undefined' on the 2nd position to be rolled up in the first line
> that
> >> >> contains a proper value in the 2nd position. So basically I'd like to
> >> >> obtain this relation:
> >> >>
> >> >> (1,event1,3)
> >> >> (1,event2,2)
> >> >> (1,event3,1)
> >> >>
> >> >> (where the 3rd value is the count of lines that were seen before a
> >> >> proper 'event' line was seen).
> >> >>
> >> >> Is this possible with Pig?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Grig
> >> >>
> >>
>

Re: Non-standard grouping

Posted by Grig Gheorghiu <gr...@gmail.com>.

Could you even do it with an UDF? In a regular programming language
you can easily do it with a sentinel that you keep track of, but in
Pig I can't figure it out....

On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
<pr...@gmail.com> wrote:
> Grig, I am afraid there is nothing built into Pig to do this.
>
> On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <gr...@gmail.com>wrote:
>
>> The count of lines seen up to and including a proper event value (3
>> lines for event1, 2 for event2, 1 for event3).
>>
>> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
>> <pr...@gmail.com> wrote:
>> > What is the last field in your output?
>> >
>> > (1,event1,3)
>> > (1,event2,2)
>> > (1,event3,1)
>> >
>> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
>> grig.gheorghiu@gmail.com>wrote:
>> >
>> >> Let's say I have this dataset:
>> >>
>> >> 1,undefined,text1
>> >> 1,,text2
>> >> 1,event1,text3
>> >> 1,undefined,text4
>> >> 1,event2,text5
>> >> 1,event3,text6
>> >>
>> >> I would like to group by 1st value, but not quite an ordinary
>> >> grouping. I would like all lines that contain either an empty value or
>> >> 'undefined' on the 2nd position to be rolled up in the first line that
>> >> contains a proper value in the 2nd position. So basically I'd like to
>> >> obtain this relation:
>> >>
>> >> (1,event1,3)
>> >> (1,event2,2)
>> >> (1,event3,1)
>> >>
>> >> (where the 3rd value is the count of lines that were seen before a
>> >> proper 'event' line was seen).
>> >>
>> >> Is this possible with Pig?
>> >>
>> >> Thanks!
>> >>
>> >> Grig
>> >>
>>

Re: Non-standard grouping

Posted by Prashant Kommireddi <pr...@gmail.com>.

Grig, I am afraid there is nothing built into Pig to do this.

On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <gr...@gmail.com>wrote:

> The count of lines seen up to and including a proper event value (3
> lines for event1, 2 for event2, 1 for event3).
>
> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
> <pr...@gmail.com> wrote:
> > What is the last field in your output?
> >
> > (1,event1,3)
> > (1,event2,2)
> > (1,event3,1)
> >
> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>wrote:
> >
> >> Let's say I have this dataset:
> >>
> >> 1,undefined,text1
> >> 1,,text2
> >> 1,event1,text3
> >> 1,undefined,text4
> >> 1,event2,text5
> >> 1,event3,text6
> >>
> >> I would like to group by 1st value, but not quite an ordinary
> >> grouping. I would like all lines that contain either an empty value or
> >> 'undefined' on the 2nd position to be rolled up in the first line that
> >> contains a proper value in the 2nd position. So basically I'd like to
> >> obtain this relation:
> >>
> >> (1,event1,3)
> >> (1,event2,2)
> >> (1,event3,1)
> >>
> >> (where the 3rd value is the count of lines that were seen before a
> >> proper 'event' line was seen).
> >>
> >> Is this possible with Pig?
> >>
> >> Thanks!
> >>
> >> Grig
> >>
>

Re: Non-standard grouping

Posted by Grig Gheorghiu <gr...@gmail.com>.

The count of lines seen up to and including a proper event value (3
lines for event1, 2 for event2, 1 for event3).

On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
<pr...@gmail.com> wrote:
> What is the last field in your output?
>
> (1,event1,3)
> (1,event2,2)
> (1,event3,1)
>
> On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <gr...@gmail.com>wrote:
>
>> Let's say I have this dataset:
>>
>> 1,undefined,text1
>> 1,,text2
>> 1,event1,text3
>> 1,undefined,text4
>> 1,event2,text5
>> 1,event3,text6
>>
>> I would like to group by 1st value, but not quite an ordinary
>> grouping. I would like all lines that contain either an empty value or
>> 'undefined' on the 2nd position to be rolled up in the first line that
>> contains a proper value in the 2nd position. So basically I'd like to
>> obtain this relation:
>>
>> (1,event1,3)
>> (1,event2,2)
>> (1,event3,1)
>>
>> (where the 3rd value is the count of lines that were seen before a
>> proper 'event' line was seen).
>>
>> Is this possible with Pig?
>>
>> Thanks!
>>
>> Grig
>>

Re: Non-standard grouping

Posted by Prashant Kommireddi <pr...@gmail.com>.

What is the last field in your output?

(1,event1,3)
(1,event2,2)
(1,event3,1)

On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <gr...@gmail.com>wrote:

> Let's say I have this dataset:
>
> 1,undefined,text1
> 1,,text2
> 1,event1,text3
> 1,undefined,text4
> 1,event2,text5
> 1,event3,text6
>
> I would like to group by 1st value, but not quite an ordinary
> grouping. I would like all lines that contain either an empty value or
> 'undefined' on the 2nd position to be rolled up in the first line that
> contains a proper value in the 2nd position. So basically I'd like to
> obtain this relation:
>
> (1,event1,3)
> (1,event2,2)
> (1,event3,1)
>
> (where the 3rd value is the count of lines that were seen before a
> proper 'event' line was seen).
>
> Is this possible with Pig?
>
> Thanks!
>
> Grig
>