You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2012/02/03 00:43:17 UTC

Re: Non-standard grouping

"records before" is kind of hard do define in an MR paradigm.
I suppose you could group and then run the records through an accumulative
UDF. But this is feeling very hacky. Is there a more scalable
(order-independent) way you can do what you need?

On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <gr...@gmail.com>wrote:

> Could you even do it with an UDF? In a regular programming language
> you can easily do it with a sentinel that you keep track of, but in
> Pig I can't figure it out....
>
> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
> <pr...@gmail.com> wrote:
> > Grig, I am afraid there is nothing built into Pig to do this.
> >
> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>wrote:
> >
> >> The count of lines seen up to and including a proper event value (3
> >> lines for event1, 2 for event2, 1 for event3).
> >>
> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
> >> <pr...@gmail.com> wrote:
> >> > What is the last field in your output?
> >> >
> >> > (1,event1,3)
> >> > (1,event2,2)
> >> > (1,event3,1)
> >> >
> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
> >> grig.gheorghiu@gmail.com>wrote:
> >> >
> >> >> Let's say I have this dataset:
> >> >>
> >> >> 1,undefined,text1
> >> >> 1,,text2
> >> >> 1,event1,text3
> >> >> 1,undefined,text4
> >> >> 1,event2,text5
> >> >> 1,event3,text6
> >> >>
> >> >> I would like to group by 1st value, but not quite an ordinary
> >> >> grouping. I would like all lines that contain either an empty value
> or
> >> >> 'undefined' on the 2nd position to be rolled up in the first line
> that
> >> >> contains a proper value in the 2nd position. So basically I'd like to
> >> >> obtain this relation:
> >> >>
> >> >> (1,event1,3)
> >> >> (1,event2,2)
> >> >> (1,event3,1)
> >> >>
> >> >> (where the 3rd value is the count of lines that were seen before a
> >> >> proper 'event' line was seen).
> >> >>
> >> >> Is this possible with Pig?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Grig
> >> >>
> >>
>

Re: Non-standard grouping

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Ah, yeah, if you can shrink data down that much, going outside of Pig (or
doing things in a UDF) is the way to go.

D

On Thu, Feb 2, 2012 at 3:45 PM, Grig Gheorghiu <gr...@gmail.com>wrote:

> Hey Dmitriy! Unfortunately that't the requirement. The solution I
> found so far is to do all the pre-filtering and grouping I can in Pig,
> and then run Python on the output file generated by Pig. That file is
> ~ 300 MB, so it's not a problem to just run through Python.
>
> Thanks for getting back to me.
>
> Grig
>
> On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> > "records before" is kind of hard do define in an MR paradigm.
> > I suppose you could group and then run the records through an
> accumulative
> > UDF. But this is feeling very hacky. Is there a more scalable
> > (order-independent) way you can do what you need?
> >
> > On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>wrote:
> >
> >> Could you even do it with an UDF? In a regular programming language
> >> you can easily do it with a sentinel that you keep track of, but in
> >> Pig I can't figure it out....
> >>
> >> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
> >> <pr...@gmail.com> wrote:
> >> > Grig, I am afraid there is nothing built into Pig to do this.
> >> >
> >> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
> >> grig.gheorghiu@gmail.com>wrote:
> >> >
> >> >> The count of lines seen up to and including a proper event value (3
> >> >> lines for event1, 2 for event2, 1 for event3).
> >> >>
> >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
> >> >> <pr...@gmail.com> wrote:
> >> >> > What is the last field in your output?
> >> >> >
> >> >> > (1,event1,3)
> >> >> > (1,event2,2)
> >> >> > (1,event3,1)
> >> >> >
> >> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
> >> >> grig.gheorghiu@gmail.com>wrote:
> >> >> >
> >> >> >> Let's say I have this dataset:
> >> >> >>
> >> >> >> 1,undefined,text1
> >> >> >> 1,,text2
> >> >> >> 1,event1,text3
> >> >> >> 1,undefined,text4
> >> >> >> 1,event2,text5
> >> >> >> 1,event3,text6
> >> >> >>
> >> >> >> I would like to group by 1st value, but not quite an ordinary
> >> >> >> grouping. I would like all lines that contain either an empty
> value
> >> or
> >> >> >> 'undefined' on the 2nd position to be rolled up in the first line
> >> that
> >> >> >> contains a proper value in the 2nd position. So basically I'd
> like to
> >> >> >> obtain this relation:
> >> >> >>
> >> >> >> (1,event1,3)
> >> >> >> (1,event2,2)
> >> >> >> (1,event3,1)
> >> >> >>
> >> >> >> (where the 3rd value is the count of lines that were seen before a
> >> >> >> proper 'event' line was seen).
> >> >> >>
> >> >> >> Is this possible with Pig?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Grig
> >> >> >>
> >> >>
> >>
>

Re: Non-standard grouping

Posted by Grig Gheorghiu <gr...@gmail.com>.
Hey Dmitriy! Unfortunately that't the requirement. The solution I
found so far is to do all the pre-filtering and grouping I can in Pig,
and then run Python on the output file generated by Pig. That file is
~ 300 MB, so it's not a problem to just run through Python.

Thanks for getting back to me.

Grig

On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> "records before" is kind of hard do define in an MR paradigm.
> I suppose you could group and then run the records through an accumulative
> UDF. But this is feeling very hacky. Is there a more scalable
> (order-independent) way you can do what you need?
>
> On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <gr...@gmail.com>wrote:
>
>> Could you even do it with an UDF? In a regular programming language
>> you can easily do it with a sentinel that you keep track of, but in
>> Pig I can't figure it out....
>>
>> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
>> <pr...@gmail.com> wrote:
>> > Grig, I am afraid there is nothing built into Pig to do this.
>> >
>> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
>> grig.gheorghiu@gmail.com>wrote:
>> >
>> >> The count of lines seen up to and including a proper event value (3
>> >> lines for event1, 2 for event2, 1 for event3).
>> >>
>> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
>> >> <pr...@gmail.com> wrote:
>> >> > What is the last field in your output?
>> >> >
>> >> > (1,event1,3)
>> >> > (1,event2,2)
>> >> > (1,event3,1)
>> >> >
>> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
>> >> grig.gheorghiu@gmail.com>wrote:
>> >> >
>> >> >> Let's say I have this dataset:
>> >> >>
>> >> >> 1,undefined,text1
>> >> >> 1,,text2
>> >> >> 1,event1,text3
>> >> >> 1,undefined,text4
>> >> >> 1,event2,text5
>> >> >> 1,event3,text6
>> >> >>
>> >> >> I would like to group by 1st value, but not quite an ordinary
>> >> >> grouping. I would like all lines that contain either an empty value
>> or
>> >> >> 'undefined' on the 2nd position to be rolled up in the first line
>> that
>> >> >> contains a proper value in the 2nd position. So basically I'd like to
>> >> >> obtain this relation:
>> >> >>
>> >> >> (1,event1,3)
>> >> >> (1,event2,2)
>> >> >> (1,event3,1)
>> >> >>
>> >> >> (where the 3rd value is the count of lines that were seen before a
>> >> >> proper 'event' line was seen).
>> >> >>
>> >> >> Is this possible with Pig?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Grig
>> >> >>
>> >>
>>