You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Vincent Barat <vb...@ubikod.com> on 2010/07/12 15:23:34 UTC

Any better way to ensure unicity ?

  Hello everybody,

I have a simple table containing sessions. Each sessions has an 
unique key (the sid, which is actually a uuid).
But a session can be present several times in my input table.

I want to ensure that I only have 1 record for each sid (because I 
perform subsequent JOIN based on this sid).

Currently I use the following script, but I wonder if there is 
something more efficient:

sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE 
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the fields I 
have in the session table...

Do you see any optimization I can do, especially on the FLATTEN / 
GENERATE part ?

Thank you very much for your help.

Re: Any better way to ensure unicity ?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

oh and of course you can go crazy with this:

fract = limit (foreach (load 'tmp/numbers' as (letter:chararray, x:int,
y:int)) generate letter) 1;


On Thu, Jul 15, 2010 at 3:56 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Um.
>
> grunt> nums = limit (load 'tmp/numbers' as (letter:chararray, x:int,
> y:int)) 1;
> grunt> dump nums
> (a,1,2)
>
> grunt> nums = load 'tmp/numbers' as (letter:chararray, x:int,
> y:int);
> grunt> fract = limit (foreach nums generate letter)
> 1;
> grunt> dump fract
> (a)
>
> Note that you can do the same for a number of operators, including, most
> handily, foreach:
>
> foo = foreach (group data by id) generate group as id, COUNT(data) as
> num_rows;
>
>
> On Thu, Jul 15, 2010 at 3:39 PM, hc busy <hc...@gmail.com> wrote:
>
>> But, to be clear, PigLatin is easy to read tho, so far, even with a 2k
>> line
>> script...
>>
>> On Thu, Jul 15, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
>>
>> > LIMIT is an extra line to type. But I guess if we're using pig, we don't
>> > really care for elegance and concision huh?
>> >
>> >
>> > On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
>> >wrote:
>> >
>> >> hc, two things about that approach :
>> >>
>> >> 1) if you use the accumulator interface, the bag won't be materialized
>> >> 2) am I missing something? Why can't you just use LIMIT 1?
>> >>
>> >> -D
>> >>
>> >> On Wed, Jul 14, 2010 at 10:39 AM, hc busy <hc...@gmail.com> wrote:
>> >>
>> >> > Write a UDF called
>> >> >
>> >> > takeOne()
>> >> >
>> >> > that takes the first thing from the bag and returns it. The only
>> problem
>> >> > that I'm having is that this UDF cannot signal to pig that it is
>> done.
>> >> So
>> >> > that whole bag is always created in it's entirety.
>> >> >
>> >> >
>> >> > Btw, this UDF will be able to accomplish the same task (picking out
>> one
>> >> > item
>> >> > out fo a bag)
>> >> >
>> >> > https://issues.apache.org/jira/browse/PIG-1386
>> >> >
>> >> > because MaxTupleByNthField extends the original MaxTupleBy1stField by
>> >> > allowing you to specify any column in the tuple as the comparison
>> key.
>> >> And
>> >> > because it handles typing correctly, your schema will be as you
>> expect
>> >> > automatically.
>> >> >
>> >> > sessions = GROUP sessions BY sid;
>> >> > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> >> > FLATTEN(first);};
>> >> > sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>> have
>> >> in
>> >> > the session table...
>> >> >
>> >> >
>> >> > is replaced with
>> >> >
>> >> > session = GROUP session by sid;
>> >> > session = FOREACH session generate MaxTupleByNthField(session);
>> >> >
>> >> > that's it. it'll have the right schema, all columns from before, but
>> >> choses
>> >> > one of the data points.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <
>> scott@richrelevance.com
>> >> > >wrote:
>> >> >
>> >> > > I run into this situation all the time.  You have to do a foreach
>> ...
>> >> > > generate projection at the end to rename everything.
>> >> > >
>> >> > > The way aliases work in pig, you quite often have to do 'renaming
>> >> only'
>> >> > > projections if you don't want to make other bits of code later
>> change:
>> >> > > After the group and limit:
>> >> > >
>> >> > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as
>> >> field2,
>> >> > > field3 ad field3 . . .
>> >> > >
>> >> > > That will get rid of the :: prefixes and make the alias shareable
>> with
>> >> > > later pig code and not dependent on what you do in the group to
>> filter
>> >> > data.
>> >> > >
>> >> > >
>> >> > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>> >> > >
>> >> > > >  Actually you are right: the schema is the same, nevertheless,
>> the
>> >> > > > "naming" of the various columns in the schema is modified, and
>> thus
>> >> > > > my subsequent operations fail:
>> >> > > >
>> >> > > > original schema:
>> >> > > > sessions: {sid: chararray,infoid: chararray,imei:
>> chararray,start:
>> >> > long}
>> >> > > >
>> >> > > > modified schema:
>> >> > > > sessions: {first::sid: chararray,first::infoid:
>> >> > > > chararray,first::imei: chararray,first::start: long}
>> >> > > >
>> >> > > > Do you know a workaround ?
>> >> > > >
>> >> > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>> >> > > >>
>> >> > > >> The flatten will return the same schema as before (in 'first') :
>> >> > > >> so unless you are modifying the fields or the order in which
>> they
>> >> > > >> are generated (which I dont think you are in view of your
>> comment
>> >> > > >> that it should work with and without this), you can simply go
>> with
>> >> :
>> >> > > >>
>> >> > > >> -- Or whatever works for you.
>> >> > > >> %define PARALLELISM        '10'
>> >> > > >>
>> >> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>> >> > > >>
>> >> > > >> OR
>> >> > > >>
>> >> > > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>> >> > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> >> > > >> FLATTEN(first);};
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> The schema at the end would be exactly same as start of the code
>> >> > > >> snippet for 'sessions'.
>> >> > > >>
>> >> > > >>
>> >> > > >> Regards,
>> >> > > >> Mridul
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>> >> > > >>>>
>> >> > > >>>> I am not sure what you mean here exactly.
>> >> > > >>>> Will a sid row have multiple (different) values for the other
>> >> > > >>>> fields ?
>> >> > > >>> Yes.
>> >> > > >>>>
>> >> > > >>>> But if you want to pick any one row for a given sid, then I
>> think
>> >> > > >>>> what you have below might be good enough (you can omit the
>> last
>> >> > > >>>> line though).
>> >> > > >>> OK. Thanks. The last line is used to retrieve the exact same
>> data
>> >> > > >>> structure and naming as the original table. This way, I can
>> >> > > >>> optionally perform this treatment without modifying my code. If
>> >> you
>> >> > > >>> know a better way...
>> >> > > >>>
>> >> > > >>> Cheers,
>> >> > > >>>
>> >> > > >>>>
>> >> > > >>>> Regards,
>> >> > > >>>> Mridul
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>> >> > > >>>>>    Hello everybody,
>> >> > > >>>>>
>> >> > > >>>>> I have a simple table containing sessions. Each sessions has
>> an
>> >> > > >>>>> unique key (the sid, which is actually a uuid).
>> >> > > >>>>> But a session can be present several times in my input table.
>> >> > > >>>>>
>> >> > > >>>>> I want to ensure that I only have 1 record for each sid
>> (because
>> >> I
>> >> > > >>>>> perform subsequent JOIN based on this sid).
>> >> > > >>>>>
>> >> > > >>>>> Currently I use the following script, but I wonder if there
>> is
>> >> > > >>>>> something more efficient:
>> >> > > >>>>>
>> >> > > >>>>> sessions = GROUP sessions BY sid;
>> >> > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1;
>> GENERATE
>> >> > > >>>>> FLATTEN(first);};
>> >> > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the
>> fields
>> >> I
>> >> > > >>>>> have in the session table...
>> >> > > >>>>>
>> >> > > >>>>> Do you see any optimization I can do, especially on the
>> FLATTEN
>> >> /
>> >> > > >>>>> GENERATE part ?
>> >> > > >>>>>
>> >> > > >>>>> Thank you very much for your help.
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>
>> >> > > >>
>> >> > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: Any better way to ensure unicity ?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Um.

grunt> nums = limit (load 'tmp/numbers' as (letter:chararray, x:int, y:int))
1;
grunt> dump nums
(a,1,2)

grunt> nums = load 'tmp/numbers' as (letter:chararray, x:int,
y:int);
grunt> fract = limit (foreach nums generate letter)
1;
grunt> dump fract
(a)

Note that you can do the same for a number of operators, including, most
handily, foreach:

foo = foreach (group data by id) generate group as id, COUNT(data) as
num_rows;

On Thu, Jul 15, 2010 at 3:39 PM, hc busy <hc...@gmail.com> wrote:

> But, to be clear, PigLatin is easy to read tho, so far, even with a 2k line
> script...
>
> On Thu, Jul 15, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
>
> > LIMIT is an extra line to type. But I guess if we're using pig, we don't
> > really care for elegance and concision huh?
> >
> >
> > On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> >wrote:
> >
> >> hc, two things about that approach :
> >>
> >> 1) if you use the accumulator interface, the bag won't be materialized
> >> 2) am I missing something? Why can't you just use LIMIT 1?
> >>
> >> -D
> >>
> >> On Wed, Jul 14, 2010 at 10:39 AM, hc busy <hc...@gmail.com> wrote:
> >>
> >> > Write a UDF called
> >> >
> >> > takeOne()
> >> >
> >> > that takes the first thing from the bag and returns it. The only
> problem
> >> > that I'm having is that this UDF cannot signal to pig that it is done.
> >> So
> >> > that whole bag is always created in it's entirety.
> >> >
> >> >
> >> > Btw, this UDF will be able to accomplish the same task (picking out
> one
> >> > item
> >> > out fo a bag)
> >> >
> >> > https://issues.apache.org/jira/browse/PIG-1386
> >> >
> >> > because MaxTupleByNthField extends the original MaxTupleBy1stField by
> >> > allowing you to specify any column in the tuple as the comparison key.
> >> And
> >> > because it handles typing correctly, your schema will be as you expect
> >> > automatically.
> >> >
> >> > sessions = GROUP sessions BY sid;
> >> > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> >> > FLATTEN(first);};
> >> > sessions = FOREACH sessions GENERATE sid, .. and all the fields I have
> >> in
> >> > the session table...
> >> >
> >> >
> >> > is replaced with
> >> >
> >> > session = GROUP session by sid;
> >> > session = FOREACH session generate MaxTupleByNthField(session);
> >> >
> >> > that's it. it'll have the right schema, all columns from before, but
> >> choses
> >> > one of the data points.
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <scott@richrelevance.com
> >> > >wrote:
> >> >
> >> > > I run into this situation all the time.  You have to do a foreach
> ...
> >> > > generate projection at the end to rename everything.
> >> > >
> >> > > The way aliases work in pig, you quite often have to do 'renaming
> >> only'
> >> > > projections if you don't want to make other bits of code later
> change:
> >> > > After the group and limit:
> >> > >
> >> > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as
> >> field2,
> >> > > field3 ad field3 . . .
> >> > >
> >> > > That will get rid of the :: prefixes and make the alias shareable
> with
> >> > > later pig code and not dependent on what you do in the group to
> filter
> >> > data.
> >> > >
> >> > >
> >> > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
> >> > >
> >> > > >  Actually you are right: the schema is the same, nevertheless, the
> >> > > > "naming" of the various columns in the schema is modified, and
> thus
> >> > > > my subsequent operations fail:
> >> > > >
> >> > > > original schema:
> >> > > > sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
> >> > long}
> >> > > >
> >> > > > modified schema:
> >> > > > sessions: {first::sid: chararray,first::infoid:
> >> > > > chararray,first::imei: chararray,first::start: long}
> >> > > >
> >> > > > Do you know a workaround ?
> >> > > >
> >> > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
> >> > > >>
> >> > > >> The flatten will return the same schema as before (in 'first') :
> >> > > >> so unless you are modifying the fields or the order in which they
> >> > > >> are generated (which I dont think you are in view of your comment
> >> > > >> that it should work with and without this), you can simply go
> with
> >> :
> >> > > >>
> >> > > >> -- Or whatever works for you.
> >> > > >> %define PARALLELISM        '10'
> >> > > >>
> >> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
> >> > > >>
> >> > > >> OR
> >> > > >>
> >> > > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> >> > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> >> > > >> FLATTEN(first);};
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> The schema at the end would be exactly same as start of the code
> >> > > >> snippet for 'sessions'.
> >> > > >>
> >> > > >>
> >> > > >> Regards,
> >> > > >> Mridul
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
> >> > > >>>
> >> > > >>>
> >> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
> >> > > >>>>
> >> > > >>>> I am not sure what you mean here exactly.
> >> > > >>>> Will a sid row have multiple (different) values for the other
> >> > > >>>> fields ?
> >> > > >>> Yes.
> >> > > >>>>
> >> > > >>>> But if you want to pick any one row for a given sid, then I
> think
> >> > > >>>> what you have below might be good enough (you can omit the last
> >> > > >>>> line though).
> >> > > >>> OK. Thanks. The last line is used to retrieve the exact same
> data
> >> > > >>> structure and naming as the original table. This way, I can
> >> > > >>> optionally perform this treatment without modifying my code. If
> >> you
> >> > > >>> know a better way...
> >> > > >>>
> >> > > >>> Cheers,
> >> > > >>>
> >> > > >>>>
> >> > > >>>> Regards,
> >> > > >>>> Mridul
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
> >> > > >>>>>    Hello everybody,
> >> > > >>>>>
> >> > > >>>>> I have a simple table containing sessions. Each sessions has
> an
> >> > > >>>>> unique key (the sid, which is actually a uuid).
> >> > > >>>>> But a session can be present several times in my input table.
> >> > > >>>>>
> >> > > >>>>> I want to ensure that I only have 1 record for each sid
> (because
> >> I
> >> > > >>>>> perform subsequent JOIN based on this sid).
> >> > > >>>>>
> >> > > >>>>> Currently I use the following script, but I wonder if there is
> >> > > >>>>> something more efficient:
> >> > > >>>>>
> >> > > >>>>> sessions = GROUP sessions BY sid;
> >> > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1;
> GENERATE
> >> > > >>>>> FLATTEN(first);};
> >> > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the
> fields
> >> I
> >> > > >>>>> have in the session table...
> >> > > >>>>>
> >> > > >>>>> Do you see any optimization I can do, especially on the
> FLATTEN
> >> /
> >> > > >>>>> GENERATE part ?
> >> > > >>>>>
> >> > > >>>>> Thank you very much for your help.
> >> > > >>>>
> >> > > >>>>
> >> > > >>
> >> > > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Any better way to ensure unicity ?

Posted by hc busy <hc...@gmail.com>.

But, to be clear, PigLatin is easy to read tho, so far, even with a 2k line
script...

On Thu, Jul 15, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:

> LIMIT is an extra line to type. But I guess if we're using pig, we don't
> really care for elegance and concision huh?
>
>
> On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> hc, two things about that approach :
>>
>> 1) if you use the accumulator interface, the bag won't be materialized
>> 2) am I missing something? Why can't you just use LIMIT 1?
>>
>> -D
>>
>> On Wed, Jul 14, 2010 at 10:39 AM, hc busy <hc...@gmail.com> wrote:
>>
>> > Write a UDF called
>> >
>> > takeOne()
>> >
>> > that takes the first thing from the bag and returns it. The only problem
>> > that I'm having is that this UDF cannot signal to pig that it is done.
>> So
>> > that whole bag is always created in it's entirety.
>> >
>> >
>> > Btw, this UDF will be able to accomplish the same task (picking out one
>> > item
>> > out fo a bag)
>> >
>> > https://issues.apache.org/jira/browse/PIG-1386
>> >
>> > because MaxTupleByNthField extends the original MaxTupleBy1stField by
>> > allowing you to specify any column in the tuple as the comparison key.
>> And
>> > because it handles typing correctly, your schema will be as you expect
>> > automatically.
>> >
>> > sessions = GROUP sessions BY sid;
>> > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> > FLATTEN(first);};
>> > sessions = FOREACH sessions GENERATE sid, .. and all the fields I have
>> in
>> > the session table...
>> >
>> >
>> > is replaced with
>> >
>> > session = GROUP session by sid;
>> > session = FOREACH session generate MaxTupleByNthField(session);
>> >
>> > that's it. it'll have the right schema, all columns from before, but
>> choses
>> > one of the data points.
>> >
>> >
>> >
>> >
>> > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <scott@richrelevance.com
>> > >wrote:
>> >
>> > > I run into this situation all the time.  You have to do a foreach ...
>> > > generate projection at the end to rename everything.
>> > >
>> > > The way aliases work in pig, you quite often have to do 'renaming
>> only'
>> > > projections if you don't want to make other bits of code later change:
>> > > After the group and limit:
>> > >
>> > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as
>> field2,
>> > > field3 ad field3 . . .
>> > >
>> > > That will get rid of the :: prefixes and make the alias shareable with
>> > > later pig code and not dependent on what you do in the group to filter
>> > data.
>> > >
>> > >
>> > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>> > >
>> > > >  Actually you are right: the schema is the same, nevertheless, the
>> > > > "naming" of the various columns in the schema is modified, and thus
>> > > > my subsequent operations fail:
>> > > >
>> > > > original schema:
>> > > > sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
>> > long}
>> > > >
>> > > > modified schema:
>> > > > sessions: {first::sid: chararray,first::infoid:
>> > > > chararray,first::imei: chararray,first::start: long}
>> > > >
>> > > > Do you know a workaround ?
>> > > >
>> > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>> > > >>
>> > > >> The flatten will return the same schema as before (in 'first') :
>> > > >> so unless you are modifying the fields or the order in which they
>> > > >> are generated (which I dont think you are in view of your comment
>> > > >> that it should work with and without this), you can simply go with
>> :
>> > > >>
>> > > >> -- Or whatever works for you.
>> > > >> %define PARALLELISM        '10'
>> > > >>
>> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>> > > >>
>> > > >> OR
>> > > >>
>> > > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>> > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> > > >> FLATTEN(first);};
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> The schema at the end would be exactly same as start of the code
>> > > >> snippet for 'sessions'.
>> > > >>
>> > > >>
>> > > >> Regards,
>> > > >> Mridul
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>> > > >>>
>> > > >>>
>> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>> > > >>>>
>> > > >>>> I am not sure what you mean here exactly.
>> > > >>>> Will a sid row have multiple (different) values for the other
>> > > >>>> fields ?
>> > > >>> Yes.
>> > > >>>>
>> > > >>>> But if you want to pick any one row for a given sid, then I think
>> > > >>>> what you have below might be good enough (you can omit the last
>> > > >>>> line though).
>> > > >>> OK. Thanks. The last line is used to retrieve the exact same data
>> > > >>> structure and naming as the original table. This way, I can
>> > > >>> optionally perform this treatment without modifying my code. If
>> you
>> > > >>> know a better way...
>> > > >>>
>> > > >>> Cheers,
>> > > >>>
>> > > >>>>
>> > > >>>> Regards,
>> > > >>>> Mridul
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>> > > >>>>>    Hello everybody,
>> > > >>>>>
>> > > >>>>> I have a simple table containing sessions. Each sessions has an
>> > > >>>>> unique key (the sid, which is actually a uuid).
>> > > >>>>> But a session can be present several times in my input table.
>> > > >>>>>
>> > > >>>>> I want to ensure that I only have 1 record for each sid (because
>> I
>> > > >>>>> perform subsequent JOIN based on this sid).
>> > > >>>>>
>> > > >>>>> Currently I use the following script, but I wonder if there is
>> > > >>>>> something more efficient:
>> > > >>>>>
>> > > >>>>> sessions = GROUP sessions BY sid;
>> > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> > > >>>>> FLATTEN(first);};
>> > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields
>> I
>> > > >>>>> have in the session table...
>> > > >>>>>
>> > > >>>>> Do you see any optimization I can do, especially on the FLATTEN
>> /
>> > > >>>>> GENERATE part ?
>> > > >>>>>
>> > > >>>>> Thank you very much for your help.
>> > > >>>>
>> > > >>>>
>> > > >>
>> > > >>
>> > >
>> > >
>> >
>>
>
>

Re: Any better way to ensure unicity ?

Posted by hc busy <hc...@gmail.com>.

Hmmm.... I didn't know you can go crazy like that. I take back that I said
Pig is not concise and inelegant.

Oh, there isn't a way to extend the language. I mean, unless while I wasn't
looking those additional "#define" syntaxes has been implemented already.
And I still think recursive functions are a must to have.



On Thu, Jul 15, 2010 at 8:32 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> It is more about maintaining yet another udf which duplicates functionality
> which is done by the base language ...
> So tradeoff is between using a language construct (which might be optimized
> internally) versus writing extension code.
>
> Mridul
>
>
> On Friday 16 July 2010 04:03 AM, hc busy wrote:
>
>> LIMIT is an extra line to type. But I guess if we're using pig, we don't
>> really care for elegance and concision huh?
>>
>>
>> On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy<dv...@gmail.com>
>>  wrote:
>>
>>  hc, two things about that approach :
>>>
>>> 1) if you use the accumulator interface, the bag won't be materialized
>>> 2) am I missing something? Why can't you just use LIMIT 1?
>>>
>>> -D
>>>
>>> On Wed, Jul 14, 2010 at 10:39 AM, hc busy<hc...@gmail.com>  wrote:
>>>
>>>  Write a UDF called
>>>>
>>>> takeOne()
>>>>
>>>> that takes the first thing from the bag and returns it. The only problem
>>>> that I'm having is that this UDF cannot signal to pig that it is done.
>>>> So
>>>> that whole bag is always created in it's entirety.
>>>>
>>>>
>>>> Btw, this UDF will be able to accomplish the same task (picking out one
>>>> item
>>>> out fo a bag)
>>>>
>>>> https://issues.apache.org/jira/browse/PIG-1386
>>>>
>>>> because MaxTupleByNthField extends the original MaxTupleBy1stField by
>>>> allowing you to specify any column in the tuple as the comparison key.
>>>>
>>> And
>>>
>>>> because it handles typing correctly, your schema will be as you expect
>>>> automatically.
>>>>
>>>> sessions = GROUP sessions BY sid;
>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>> FLATTEN(first);};
>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I have
>>>> in
>>>> the session table...
>>>>
>>>>
>>>> is replaced with
>>>>
>>>> session = GROUP session by sid;
>>>> session = FOREACH session generate MaxTupleByNthField(session);
>>>>
>>>> that's it. it'll have the right schema, all columns from before, but
>>>>
>>> choses
>>>
>>>> one of the data points.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey<scott@richrelevance.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  I run into this situation all the time.  You have to do a foreach ...
>>>>> generate projection at the end to rename everything.
>>>>>
>>>>> The way aliases work in pig, you quite often have to do 'renaming only'
>>>>> projections if you don't want to make other bits of code later change:
>>>>> After the group and limit:
>>>>>
>>>>> sessions = FOREACH sessions GENERATE field1 as field1, field2 as
>>>>>
>>>> field2,
>>>
>>>> field3 ad field3 . . .
>>>>>
>>>>> That will get rid of the :: prefixes and make the alias shareable with
>>>>> later pig code and not dependent on what you do in the group to filter
>>>>>
>>>> data.
>>>>
>>>>>
>>>>>
>>>>> On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>>>>>
>>>>>   Actually you are right: the schema is the same, nevertheless, the
>>>>>> "naming" of the various columns in the schema is modified, and thus
>>>>>> my subsequent operations fail:
>>>>>>
>>>>>> original schema:
>>>>>> sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
>>>>>>
>>>>> long}
>>>>
>>>>>
>>>>>> modified schema:
>>>>>> sessions: {first::sid: chararray,first::infoid:
>>>>>> chararray,first::imei: chararray,first::start: long}
>>>>>>
>>>>>> Do you know a workaround ?
>>>>>>
>>>>>> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>>>>>>
>>>>>>>
>>>>>>> The flatten will return the same schema as before (in 'first') :
>>>>>>> so unless you are modifying the fields or the order in which they
>>>>>>> are generated (which I dont think you are in view of your comment
>>>>>>> that it should work with and without this), you can simply go with :
>>>>>>>
>>>>>>> -- Or whatever works for you.
>>>>>>> %define PARALLELISM        '10'
>>>>>>>
>>>>>>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>>>>>>>
>>>>>>> OR
>>>>>>>
>>>>>>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>>> FLATTEN(first);};
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The schema at the end would be exactly same as start of the code
>>>>>>> snippet for 'sessions'.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure what you mean here exactly.
>>>>>>>>> Will a sid row have multiple (different) values for the other
>>>>>>>>> fields ?
>>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> But if you want to pick any one row for a given sid, then I think
>>>>>>>>> what you have below might be good enough (you can omit the last
>>>>>>>>> line though).
>>>>>>>>>
>>>>>>>> OK. Thanks. The last line is used to retrieve the exact same data
>>>>>>>> structure and naming as the original table. This way, I can
>>>>>>>> optionally perform this treatment without modifying my code. If you
>>>>>>>> know a better way...
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mridul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>>>>>
>>>>>>>>>>    Hello everybody,
>>>>>>>>>>
>>>>>>>>>> I have a simple table containing sessions. Each sessions has an
>>>>>>>>>> unique key (the sid, which is actually a uuid).
>>>>>>>>>> But a session can be present several times in my input table.
>>>>>>>>>>
>>>>>>>>>> I want to ensure that I only have 1 record for each sid (because
>>>>>>>>>>
>>>>>>>>> I
>>>
>>>> perform subsequent JOIN based on this sid).
>>>>>>>>>>
>>>>>>>>>> Currently I use the following script, but I wonder if there is
>>>>>>>>>> something more efficient:
>>>>>>>>>>
>>>>>>>>>> sessions = GROUP sessions BY sid;
>>>>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>>>>>> FLATTEN(first);};
>>>>>>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>>>>>>>>> have in the session table...
>>>>>>>>>>
>>>>>>>>>> Do you see any optimization I can do, especially on the FLATTEN /
>>>>>>>>>> GENERATE part ?
>>>>>>>>>>
>>>>>>>>>> Thank you very much for your help.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: Any better way to ensure unicity ?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

It is more about maintaining yet another udf which duplicates 
functionality which is done by the base language ...
So tradeoff is between using a language construct (which might be 
optimized internally) versus writing extension code.

Mridul

On Friday 16 July 2010 04:03 AM, hc busy wrote:
> LIMIT is an extra line to type. But I guess if we're using pig, we don't
> really care for elegance and concision huh?
>
>
> On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy<dv...@gmail.com>  wrote:
>
>> hc, two things about that approach :
>>
>> 1) if you use the accumulator interface, the bag won't be materialized
>> 2) am I missing something? Why can't you just use LIMIT 1?
>>
>> -D
>>
>> On Wed, Jul 14, 2010 at 10:39 AM, hc busy<hc...@gmail.com>  wrote:
>>
>>> Write a UDF called
>>>
>>> takeOne()
>>>
>>> that takes the first thing from the bag and returns it. The only problem
>>> that I'm having is that this UDF cannot signal to pig that it is done. So
>>> that whole bag is always created in it's entirety.
>>>
>>>
>>> Btw, this UDF will be able to accomplish the same task (picking out one
>>> item
>>> out fo a bag)
>>>
>>> https://issues.apache.org/jira/browse/PIG-1386
>>>
>>> because MaxTupleByNthField extends the original MaxTupleBy1stField by
>>> allowing you to specify any column in the tuple as the comparison key.
>> And
>>> because it handles typing correctly, your schema will be as you expect
>>> automatically.
>>>
>>> sessions = GROUP sessions BY sid;
>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>> FLATTEN(first);};
>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in
>>> the session table...
>>>
>>>
>>> is replaced with
>>>
>>> session = GROUP session by sid;
>>> session = FOREACH session generate MaxTupleByNthField(session);
>>>
>>> that's it. it'll have the right schema, all columns from before, but
>> choses
>>> one of the data points.
>>>
>>>
>>>
>>>
>>> On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey<scott@richrelevance.com
>>>> wrote:
>>>
>>>> I run into this situation all the time.  You have to do a foreach ...
>>>> generate projection at the end to rename everything.
>>>>
>>>> The way aliases work in pig, you quite often have to do 'renaming only'
>>>> projections if you don't want to make other bits of code later change:
>>>> After the group and limit:
>>>>
>>>> sessions = FOREACH sessions GENERATE field1 as field1, field2 as
>> field2,
>>>> field3 ad field3 . . .
>>>>
>>>> That will get rid of the :: prefixes and make the alias shareable with
>>>> later pig code and not dependent on what you do in the group to filter
>>> data.
>>>>
>>>>
>>>> On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>>>>
>>>>>   Actually you are right: the schema is the same, nevertheless, the
>>>>> "naming" of the various columns in the schema is modified, and thus
>>>>> my subsequent operations fail:
>>>>>
>>>>> original schema:
>>>>> sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
>>> long}
>>>>>
>>>>> modified schema:
>>>>> sessions: {first::sid: chararray,first::infoid:
>>>>> chararray,first::imei: chararray,first::start: long}
>>>>>
>>>>> Do you know a workaround ?
>>>>>
>>>>> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>>>>>>
>>>>>> The flatten will return the same schema as before (in 'first') :
>>>>>> so unless you are modifying the fields or the order in which they
>>>>>> are generated (which I dont think you are in view of your comment
>>>>>> that it should work with and without this), you can simply go with :
>>>>>>
>>>>>> -- Or whatever works for you.
>>>>>> %define PARALLELISM        '10'
>>>>>>
>>>>>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>>>>>>
>>>>>> OR
>>>>>>
>>>>>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>> FLATTEN(first);};
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The schema at the end would be exactly same as start of the code
>>>>>> snippet for 'sessions'.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>>>>>>
>>>>>>>
>>>>>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>>>>>>
>>>>>>>> I am not sure what you mean here exactly.
>>>>>>>> Will a sid row have multiple (different) values for the other
>>>>>>>> fields ?
>>>>>>> Yes.
>>>>>>>>
>>>>>>>> But if you want to pick any one row for a given sid, then I think
>>>>>>>> what you have below might be good enough (you can omit the last
>>>>>>>> line though).
>>>>>>> OK. Thanks. The last line is used to retrieve the exact same data
>>>>>>> structure and naming as the original table. This way, I can
>>>>>>> optionally perform this treatment without modifying my code. If you
>>>>>>> know a better way...
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>>>>>     Hello everybody,
>>>>>>>>>
>>>>>>>>> I have a simple table containing sessions. Each sessions has an
>>>>>>>>> unique key (the sid, which is actually a uuid).
>>>>>>>>> But a session can be present several times in my input table.
>>>>>>>>>
>>>>>>>>> I want to ensure that I only have 1 record for each sid (because
>> I
>>>>>>>>> perform subsequent JOIN based on this sid).
>>>>>>>>>
>>>>>>>>> Currently I use the following script, but I wonder if there is
>>>>>>>>> something more efficient:
>>>>>>>>>
>>>>>>>>> sessions = GROUP sessions BY sid;
>>>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>>>>> FLATTEN(first);};
>>>>>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>>>>>>>> have in the session table...
>>>>>>>>>
>>>>>>>>> Do you see any optimization I can do, especially on the FLATTEN /
>>>>>>>>> GENERATE part ?
>>>>>>>>>
>>>>>>>>> Thank you very much for your help.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>

Re: Any better way to ensure unicity ?

Posted by hc busy <hc...@gmail.com>.

LIMIT is an extra line to type. But I guess if we're using pig, we don't
really care for elegance and concision huh?


On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> hc, two things about that approach :
>
> 1) if you use the accumulator interface, the bag won't be materialized
> 2) am I missing something? Why can't you just use LIMIT 1?
>
> -D
>
> On Wed, Jul 14, 2010 at 10:39 AM, hc busy <hc...@gmail.com> wrote:
>
> > Write a UDF called
> >
> > takeOne()
> >
> > that takes the first thing from the bag and returns it. The only problem
> > that I'm having is that this UDF cannot signal to pig that it is done. So
> > that whole bag is always created in it's entirety.
> >
> >
> > Btw, this UDF will be able to accomplish the same task (picking out one
> > item
> > out fo a bag)
> >
> > https://issues.apache.org/jira/browse/PIG-1386
> >
> > because MaxTupleByNthField extends the original MaxTupleBy1stField by
> > allowing you to specify any column in the tuple as the comparison key.
> And
> > because it handles typing correctly, your schema will be as you expect
> > automatically.
> >
> > sessions = GROUP sessions BY sid;
> > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > FLATTEN(first);};
> > sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in
> > the session table...
> >
> >
> > is replaced with
> >
> > session = GROUP session by sid;
> > session = FOREACH session generate MaxTupleByNthField(session);
> >
> > that's it. it'll have the right schema, all columns from before, but
> choses
> > one of the data points.
> >
> >
> >
> >
> > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <scott@richrelevance.com
> > >wrote:
> >
> > > I run into this situation all the time.  You have to do a foreach ...
> > > generate projection at the end to rename everything.
> > >
> > > The way aliases work in pig, you quite often have to do 'renaming only'
> > > projections if you don't want to make other bits of code later change:
> > > After the group and limit:
> > >
> > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as
> field2,
> > > field3 ad field3 . . .
> > >
> > > That will get rid of the :: prefixes and make the alias shareable with
> > > later pig code and not dependent on what you do in the group to filter
> > data.
> > >
> > >
> > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
> > >
> > > >  Actually you are right: the schema is the same, nevertheless, the
> > > > "naming" of the various columns in the schema is modified, and thus
> > > > my subsequent operations fail:
> > > >
> > > > original schema:
> > > > sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
> > long}
> > > >
> > > > modified schema:
> > > > sessions: {first::sid: chararray,first::infoid:
> > > > chararray,first::imei: chararray,first::start: long}
> > > >
> > > > Do you know a workaround ?
> > > >
> > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
> > > >>
> > > >> The flatten will return the same schema as before (in 'first') :
> > > >> so unless you are modifying the fields or the order in which they
> > > >> are generated (which I dont think you are in view of your comment
> > > >> that it should work with and without this), you can simply go with :
> > > >>
> > > >> -- Or whatever works for you.
> > > >> %define PARALLELISM        '10'
> > > >>
> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
> > > >>
> > > >> OR
> > > >>
> > > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > > >> FLATTEN(first);};
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> The schema at the end would be exactly same as start of the code
> > > >> snippet for 'sessions'.
> > > >>
> > > >>
> > > >> Regards,
> > > >> Mridul
> > > >>
> > > >>
> > > >>
> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
> > > >>>
> > > >>>
> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
> > > >>>>
> > > >>>> I am not sure what you mean here exactly.
> > > >>>> Will a sid row have multiple (different) values for the other
> > > >>>> fields ?
> > > >>> Yes.
> > > >>>>
> > > >>>> But if you want to pick any one row for a given sid, then I think
> > > >>>> what you have below might be good enough (you can omit the last
> > > >>>> line though).
> > > >>> OK. Thanks. The last line is used to retrieve the exact same data
> > > >>> structure and naming as the original table. This way, I can
> > > >>> optionally perform this treatment without modifying my code. If you
> > > >>> know a better way...
> > > >>>
> > > >>> Cheers,
> > > >>>
> > > >>>>
> > > >>>> Regards,
> > > >>>> Mridul
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
> > > >>>>>    Hello everybody,
> > > >>>>>
> > > >>>>> I have a simple table containing sessions. Each sessions has an
> > > >>>>> unique key (the sid, which is actually a uuid).
> > > >>>>> But a session can be present several times in my input table.
> > > >>>>>
> > > >>>>> I want to ensure that I only have 1 record for each sid (because
> I
> > > >>>>> perform subsequent JOIN based on this sid).
> > > >>>>>
> > > >>>>> Currently I use the following script, but I wonder if there is
> > > >>>>> something more efficient:
> > > >>>>>
> > > >>>>> sessions = GROUP sessions BY sid;
> > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > > >>>>> FLATTEN(first);};
> > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
> > > >>>>> have in the session table...
> > > >>>>>
> > > >>>>> Do you see any optimization I can do, especially on the FLATTEN /
> > > >>>>> GENERATE part ?
> > > >>>>>
> > > >>>>> Thank you very much for your help.
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Any better way to ensure unicity ?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

hc, two things about that approach :

1) if you use the accumulator interface, the bag won't be materialized
2) am I missing something? Why can't you just use LIMIT 1?

-D

On Wed, Jul 14, 2010 at 10:39 AM, hc busy <hc...@gmail.com> wrote:

> Write a UDF called
>
> takeOne()
>
> that takes the first thing from the bag and returns it. The only problem
> that I'm having is that this UDF cannot signal to pig that it is done. So
> that whole bag is always created in it's entirety.
>
>
> Btw, this UDF will be able to accomplish the same task (picking out one
> item
> out fo a bag)
>
> https://issues.apache.org/jira/browse/PIG-1386
>
> because MaxTupleByNthField extends the original MaxTupleBy1stField by
> allowing you to specify any column in the tuple as the comparison key. And
> because it handles typing correctly, your schema will be as you expect
> automatically.
>
> sessions = GROUP sessions BY sid;
> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> FLATTEN(first);};
> sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in
> the session table...
>
>
> is replaced with
>
> session = GROUP session by sid;
> session = FOREACH session generate MaxTupleByNthField(session);
>
> that's it. it'll have the right schema, all columns from before, but choses
> one of the data points.
>
>
>
>
> On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <scott@richrelevance.com
> >wrote:
>
> > I run into this situation all the time.  You have to do a foreach ...
> > generate projection at the end to rename everything.
> >
> > The way aliases work in pig, you quite often have to do 'renaming only'
> > projections if you don't want to make other bits of code later change:
> > After the group and limit:
> >
> > sessions = FOREACH sessions GENERATE field1 as field1, field2 as field2,
> > field3 ad field3 . . .
> >
> > That will get rid of the :: prefixes and make the alias shareable with
> > later pig code and not dependent on what you do in the group to filter
> data.
> >
> >
> > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
> >
> > >  Actually you are right: the schema is the same, nevertheless, the
> > > "naming" of the various columns in the schema is modified, and thus
> > > my subsequent operations fail:
> > >
> > > original schema:
> > > sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
> long}
> > >
> > > modified schema:
> > > sessions: {first::sid: chararray,first::infoid:
> > > chararray,first::imei: chararray,first::start: long}
> > >
> > > Do you know a workaround ?
> > >
> > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
> > >>
> > >> The flatten will return the same schema as before (in 'first') :
> > >> so unless you are modifying the fields or the order in which they
> > >> are generated (which I dont think you are in view of your comment
> > >> that it should work with and without this), you can simply go with :
> > >>
> > >> -- Or whatever works for you.
> > >> %define PARALLELISM        '10'
> > >>
> > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
> > >>
> > >> OR
> > >>
> > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > >> FLATTEN(first);};
> > >>
> > >>
> > >>
> > >>
> > >> The schema at the end would be exactly same as start of the code
> > >> snippet for 'sessions'.
> > >>
> > >>
> > >> Regards,
> > >> Mridul
> > >>
> > >>
> > >>
> > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
> > >>>
> > >>>
> > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
> > >>>>
> > >>>> I am not sure what you mean here exactly.
> > >>>> Will a sid row have multiple (different) values for the other
> > >>>> fields ?
> > >>> Yes.
> > >>>>
> > >>>> But if you want to pick any one row for a given sid, then I think
> > >>>> what you have below might be good enough (you can omit the last
> > >>>> line though).
> > >>> OK. Thanks. The last line is used to retrieve the exact same data
> > >>> structure and naming as the original table. This way, I can
> > >>> optionally perform this treatment without modifying my code. If you
> > >>> know a better way...
> > >>>
> > >>> Cheers,
> > >>>
> > >>>>
> > >>>> Regards,
> > >>>> Mridul
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
> > >>>>>    Hello everybody,
> > >>>>>
> > >>>>> I have a simple table containing sessions. Each sessions has an
> > >>>>> unique key (the sid, which is actually a uuid).
> > >>>>> But a session can be present several times in my input table.
> > >>>>>
> > >>>>> I want to ensure that I only have 1 record for each sid (because I
> > >>>>> perform subsequent JOIN based on this sid).
> > >>>>>
> > >>>>> Currently I use the following script, but I wonder if there is
> > >>>>> something more efficient:
> > >>>>>
> > >>>>> sessions = GROUP sessions BY sid;
> > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > >>>>> FLATTEN(first);};
> > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
> > >>>>> have in the session table...
> > >>>>>
> > >>>>> Do you see any optimization I can do, especially on the FLATTEN /
> > >>>>> GENERATE part ?
> > >>>>>
> > >>>>> Thank you very much for your help.
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: Any better way to ensure unicity ?

Posted by hc busy <hc...@gmail.com>.

Write a UDF called

takeOne()

that takes the first thing from the bag and returns it. The only problem
that I'm having is that this UDF cannot signal to pig that it is done. So
that whole bag is always created in it's entirety.


Btw, this UDF will be able to accomplish the same task (picking out one item
out fo a bag)

https://issues.apache.org/jira/browse/PIG-1386

because MaxTupleByNthField extends the original MaxTupleBy1stField by
allowing you to specify any column in the tuple as the comparison key. And
because it handles typing correctly, your schema will be as you expect
automatically.

sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in
the session table...


is replaced with

session = GROUP session by sid;
session = FOREACH session generate MaxTupleByNthField(session);

that's it. it'll have the right schema, all columns from before, but choses
one of the data points.




On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <sc...@richrelevance.com>wrote:

> I run into this situation all the time.  You have to do a foreach ...
> generate projection at the end to rename everything.
>
> The way aliases work in pig, you quite often have to do 'renaming only'
> projections if you don't want to make other bits of code later change:
> After the group and limit:
>
> sessions = FOREACH sessions GENERATE field1 as field1, field2 as field2,
> field3 ad field3 . . .
>
> That will get rid of the :: prefixes and make the alias shareable with
> later pig code and not dependent on what you do in the group to filter data.
>
>
> On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>
> >  Actually you are right: the schema is the same, nevertheless, the
> > "naming" of the various columns in the schema is modified, and thus
> > my subsequent operations fail:
> >
> > original schema:
> > sessions: {sid: chararray,infoid: chararray,imei: chararray,start: long}
> >
> > modified schema:
> > sessions: {first::sid: chararray,first::infoid:
> > chararray,first::imei: chararray,first::start: long}
> >
> > Do you know a workaround ?
> >
> > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
> >>
> >> The flatten will return the same schema as before (in 'first') :
> >> so unless you are modifying the fields or the order in which they
> >> are generated (which I dont think you are in view of your comment
> >> that it should work with and without this), you can simply go with :
> >>
> >> -- Or whatever works for you.
> >> %define PARALLELISM        '10'
> >>
> >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
> >>
> >> OR
> >>
> >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> >> FLATTEN(first);};
> >>
> >>
> >>
> >>
> >> The schema at the end would be exactly same as start of the code
> >> snippet for 'sessions'.
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
> >>>
> >>>
> >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
> >>>>
> >>>> I am not sure what you mean here exactly.
> >>>> Will a sid row have multiple (different) values for the other
> >>>> fields ?
> >>> Yes.
> >>>>
> >>>> But if you want to pick any one row for a given sid, then I think
> >>>> what you have below might be good enough (you can omit the last
> >>>> line though).
> >>> OK. Thanks. The last line is used to retrieve the exact same data
> >>> structure and naming as the original table. This way, I can
> >>> optionally perform this treatment without modifying my code. If you
> >>> know a better way...
> >>>
> >>> Cheers,
> >>>
> >>>>
> >>>> Regards,
> >>>> Mridul
> >>>>
> >>>>
> >>>>
> >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
> >>>>>    Hello everybody,
> >>>>>
> >>>>> I have a simple table containing sessions. Each sessions has an
> >>>>> unique key (the sid, which is actually a uuid).
> >>>>> But a session can be present several times in my input table.
> >>>>>
> >>>>> I want to ensure that I only have 1 record for each sid (because I
> >>>>> perform subsequent JOIN based on this sid).
> >>>>>
> >>>>> Currently I use the following script, but I wonder if there is
> >>>>> something more efficient:
> >>>>>
> >>>>> sessions = GROUP sessions BY sid;
> >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> >>>>> FLATTEN(first);};
> >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
> >>>>> have in the session table...
> >>>>>
> >>>>> Do you see any optimization I can do, especially on the FLATTEN /
> >>>>> GENERATE part ?
> >>>>>
> >>>>> Thank you very much for your help.
> >>>>
> >>>>
> >>
> >>
>
>

Re: Any better way to ensure unicity ?

Posted by Scott Carey <sc...@richrelevance.com>.

I run into this situation all the time.  You have to do a foreach ... generate projection at the end to rename everything.

The way aliases work in pig, you quite often have to do 'renaming only' projections if you don't want to make other bits of code later change:
After the group and limit:

sessions = FOREACH sessions GENERATE field1 as field1, field2 as field2, field3 ad field3 . . .

That will get rid of the :: prefixes and make the alias shareable with later pig code and not dependent on what you do in the group to filter data.


On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:

>  Actually you are right: the schema is the same, nevertheless, the 
> "naming" of the various columns in the schema is modified, and thus 
> my subsequent operations fail:
> 
> original schema:
> sessions: {sid: chararray,infoid: chararray,imei: chararray,start: long}
> 
> modified schema:
> sessions: {first::sid: chararray,first::infoid: 
> chararray,first::imei: chararray,first::start: long}
> 
> Do you know a workaround ?
> 
> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>> 
>> The flatten will return the same schema as before (in 'first') : 
>> so unless you are modifying the fields or the order in which they 
>> are generated (which I dont think you are in view of your comment 
>> that it should work with and without this), you can simply go with :
>> 
>> -- Or whatever works for you.
>> %define PARALLELISM        '10'
>> 
>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>> 
>> OR
>> 
>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE 
>> FLATTEN(first);};
>> 
>> 
>> 
>> 
>> The schema at the end would be exactly same as start of the code 
>> snippet for 'sessions'.
>> 
>> 
>> Regards,
>> Mridul
>> 
>> 
>> 
>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>> 
>>> 
>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>> 
>>>> I am not sure what you mean here exactly.
>>>> Will a sid row have multiple (different) values for the other
>>>> fields ?
>>> Yes.
>>>> 
>>>> But if you want to pick any one row for a given sid, then I think
>>>> what you have below might be good enough (you can omit the last
>>>> line though).
>>> OK. Thanks. The last line is used to retrieve the exact same data
>>> structure and naming as the original table. This way, I can
>>> optionally perform this treatment without modifying my code. If you
>>> know a better way...
>>> 
>>> Cheers,
>>> 
>>>> 
>>>> Regards,
>>>> Mridul
>>>> 
>>>> 
>>>> 
>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>    Hello everybody,
>>>>> 
>>>>> I have a simple table containing sessions. Each sessions has an
>>>>> unique key (the sid, which is actually a uuid).
>>>>> But a session can be present several times in my input table.
>>>>> 
>>>>> I want to ensure that I only have 1 record for each sid (because I
>>>>> perform subsequent JOIN based on this sid).
>>>>> 
>>>>> Currently I use the following script, but I wonder if there is
>>>>> something more efficient:
>>>>> 
>>>>> sessions = GROUP sessions BY sid;
>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>> FLATTEN(first);};
>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>>>> have in the session table...
>>>>> 
>>>>> Do you see any optimization I can do, especially on the FLATTEN /
>>>>> GENERATE part ?
>>>>> 
>>>>> Thank you very much for your help.
>>>> 
>>>> 
>> 
>>

Re: Any better way to ensure unicity ?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

Then project it out and then do distinct ?

like
sessions = FOREACH sessions required_fields;
sessions = DISTINCT sessions PARALLEL $PARALLELISM;


assuming you dont need timestamp ofcourse.
If you do, then the group route might be only option ...

Regards,
Mridul

On Tuesday 13 July 2010 05:57 PM, Vincent Barat wrote:
>    Yes. I would have used DISTINCT too, but I cannot, since some of
> the other fields can be different (the timestamp actually).
>
> Thanks for your help.
>
> Le 13/07/10 11:06, Mridul Muralidharan a écrit :
>>
>> I am not sure why the prefix 'first' is coming in ... someone from
>> pig team can comment better.
>> Though personally, I would use distinct over
>> group/foreach/limit/flatten combination.
>
>
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Tuesday 13 July 2010 02:18 PM, Vincent Barat wrote:
>>>     Actually you are right: the schema is the same, nevertheless, the
>>> "naming" of the various columns in the schema is modified, and thus
>>> my subsequent operations fail:
>>>
>>> original schema:
>>> sessions: {sid: chararray,infoid: chararray,imei:
>>> chararray,start: long}
>>>
>>> modified schema:
>>> sessions: {first::sid: chararray,first::infoid:
>>> chararray,first::imei: chararray,first::start: long}
>>>
>>> Do you know a workaround ?
>>>
>>> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>>>>
>>>> The flatten will return the same schema as before (in 'first') :
>>>> so unless you are modifying the fields or the order in which they
>>>> are generated (which I dont think you are in view of your comment
>>>> that it should work with and without this), you can simply go
>>>> with :
>>>>
>>>> -- Or whatever works for you.
>>>> %define PARALLELISM        '10'
>>>>
>>>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>>>>
>>>> OR
>>>>
>>>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>> FLATTEN(first);};
>>>>
>>>>
>>>>
>>>>
>>>> The schema at the end would be exactly same as start of the code
>>>> snippet for 'sessions'.
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>>>>
>>>>>
>>>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>>>>
>>>>>> I am not sure what you mean here exactly.
>>>>>> Will a sid row have multiple (different) values for the other
>>>>>> fields ?
>>>>> Yes.
>>>>>>
>>>>>> But if you want to pick any one row for a given sid, then I think
>>>>>> what you have below might be good enough (you can omit the last
>>>>>> line though).
>>>>> OK. Thanks. The last line is used to retrieve the exact same data
>>>>> structure and naming as the original table. This way, I can
>>>>> optionally perform this treatment without modifying my code. If
>>>>> you
>>>>> know a better way...
>>>>>
>>>>> Cheers,
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>>>       Hello everybody,
>>>>>>>
>>>>>>> I have a simple table containing sessions. Each sessions has an
>>>>>>> unique key (the sid, which is actually a uuid).
>>>>>>> But a session can be present several times in my input table.
>>>>>>>
>>>>>>> I want to ensure that I only have 1 record for each sid
>>>>>>> (because I
>>>>>>> perform subsequent JOIN based on this sid).
>>>>>>>
>>>>>>> Currently I use the following script, but I wonder if there is
>>>>>>> something more efficient:
>>>>>>>
>>>>>>> sessions = GROUP sessions BY sid;
>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>>> FLATTEN(first);};
>>>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the
>>>>>>> fields I
>>>>>>> have in the session table...
>>>>>>>
>>>>>>> Do you see any optimization I can do, especially on the
>>>>>>> FLATTEN /
>>>>>>> GENERATE part ?
>>>>>>>
>>>>>>> Thank you very much for your help.
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>

Re: Any better way to ensure unicity ?

Posted by Vincent Barat <vb...@ubikod.com>.

  Yes. I would have used DISTINCT too, but I cannot, since some of 
the other fields can be different (the timestamp actually).

Thanks for your help.

Le 13/07/10 11:06, Mridul Muralidharan a écrit :
>
> I am not sure why the prefix 'first' is coming in ... someone from 
> pig team can comment better.
> Though personally, I would use distinct over 
> group/foreach/limit/flatten combination.


>
>
> Regards,
> Mridul
>
>
> On Tuesday 13 July 2010 02:18 PM, Vincent Barat wrote:
>>    Actually you are right: the schema is the same, nevertheless, the
>> "naming" of the various columns in the schema is modified, and thus
>> my subsequent operations fail:
>>
>> original schema:
>> sessions: {sid: chararray,infoid: chararray,imei: 
>> chararray,start: long}
>>
>> modified schema:
>> sessions: {first::sid: chararray,first::infoid:
>> chararray,first::imei: chararray,first::start: long}
>>
>> Do you know a workaround ?
>>
>> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>>>
>>> The flatten will return the same schema as before (in 'first') :
>>> so unless you are modifying the fields or the order in which they
>>> are generated (which I dont think you are in view of your comment
>>> that it should work with and without this), you can simply go 
>>> with :
>>>
>>> -- Or whatever works for you.
>>> %define PARALLELISM        '10'
>>>
>>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>>>
>>> OR
>>>
>>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>> FLATTEN(first);};
>>>
>>>
>>>
>>>
>>> The schema at the end would be exactly same as start of the code
>>> snippet for 'sessions'.
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>>>
>>>>
>>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>>>
>>>>> I am not sure what you mean here exactly.
>>>>> Will a sid row have multiple (different) values for the other
>>>>> fields ?
>>>> Yes.
>>>>>
>>>>> But if you want to pick any one row for a given sid, then I think
>>>>> what you have below might be good enough (you can omit the last
>>>>> line though).
>>>> OK. Thanks. The last line is used to retrieve the exact same data
>>>> structure and naming as the original table. This way, I can
>>>> optionally perform this treatment without modifying my code. If 
>>>> you
>>>> know a better way...
>>>>
>>>> Cheers,
>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>>
>>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>>      Hello everybody,
>>>>>>
>>>>>> I have a simple table containing sessions. Each sessions has an
>>>>>> unique key (the sid, which is actually a uuid).
>>>>>> But a session can be present several times in my input table.
>>>>>>
>>>>>> I want to ensure that I only have 1 record for each sid 
>>>>>> (because I
>>>>>> perform subsequent JOIN based on this sid).
>>>>>>
>>>>>> Currently I use the following script, but I wonder if there is
>>>>>> something more efficient:
>>>>>>
>>>>>> sessions = GROUP sessions BY sid;
>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>> FLATTEN(first);};
>>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the 
>>>>>> fields I
>>>>>> have in the session table...
>>>>>>
>>>>>> Do you see any optimization I can do, especially on the 
>>>>>> FLATTEN /
>>>>>> GENERATE part ?
>>>>>>
>>>>>> Thank you very much for your help.
>>>>>
>>>>>
>>>
>>>
>
>

Re: Any better way to ensure unicity ?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

I am not sure why the prefix 'first' is coming in ... someone from pig 
team can comment better.
Though personally, I would use distinct over group/foreach/limit/flatten 
combination.


Regards,
Mridul


On Tuesday 13 July 2010 02:18 PM, Vincent Barat wrote:
>    Actually you are right: the schema is the same, nevertheless, the
> "naming" of the various columns in the schema is modified, and thus
> my subsequent operations fail:
>
> original schema:
> sessions: {sid: chararray,infoid: chararray,imei: chararray,start: long}
>
> modified schema:
> sessions: {first::sid: chararray,first::infoid:
> chararray,first::imei: chararray,first::start: long}
>
> Do you know a workaround ?
>
> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>>
>> The flatten will return the same schema as before (in 'first') :
>> so unless you are modifying the fields or the order in which they
>> are generated (which I dont think you are in view of your comment
>> that it should work with and without this), you can simply go with :
>>
>> -- Or whatever works for you.
>> %define PARALLELISM        '10'
>>
>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>>
>> OR
>>
>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> FLATTEN(first);};
>>
>>
>>
>>
>> The schema at the end would be exactly same as start of the code
>> snippet for 'sessions'.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>>
>>>
>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>>
>>>> I am not sure what you mean here exactly.
>>>> Will a sid row have multiple (different) values for the other
>>>> fields ?
>>> Yes.
>>>>
>>>> But if you want to pick any one row for a given sid, then I think
>>>> what you have below might be good enough (you can omit the last
>>>> line though).
>>> OK. Thanks. The last line is used to retrieve the exact same data
>>> structure and naming as the original table. This way, I can
>>> optionally perform this treatment without modifying my code. If you
>>> know a better way...
>>>
>>> Cheers,
>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>      Hello everybody,
>>>>>
>>>>> I have a simple table containing sessions. Each sessions has an
>>>>> unique key (the sid, which is actually a uuid).
>>>>> But a session can be present several times in my input table.
>>>>>
>>>>> I want to ensure that I only have 1 record for each sid (because I
>>>>> perform subsequent JOIN based on this sid).
>>>>>
>>>>> Currently I use the following script, but I wonder if there is
>>>>> something more efficient:
>>>>>
>>>>> sessions = GROUP sessions BY sid;
>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>> FLATTEN(first);};
>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>>>> have in the session table...
>>>>>
>>>>> Do you see any optimization I can do, especially on the FLATTEN /
>>>>> GENERATE part ?
>>>>>
>>>>> Thank you very much for your help.
>>>>
>>>>
>>
>>

Re: Any better way to ensure unicity ?

Posted by Vincent Barat <vb...@ubikod.com>.

  Actually you are right: the schema is the same, nevertheless, the 
"naming" of the various columns in the schema is modified, and thus 
my subsequent operations fail:

original schema:
sessions: {sid: chararray,infoid: chararray,imei: chararray,start: long}

modified schema:
sessions: {first::sid: chararray,first::infoid: 
chararray,first::imei: chararray,first::start: long}

Do you know a workaround ?

Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>
> The flatten will return the same schema as before (in 'first') : 
> so unless you are modifying the fields or the order in which they 
> are generated (which I dont think you are in view of your comment 
> that it should work with and without this), you can simply go with :
>
> -- Or whatever works for you.
> %define PARALLELISM        '10'
>
> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>
> OR
>
> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE 
> FLATTEN(first);};
>
>
>
>
> The schema at the end would be exactly same as start of the code 
> snippet for 'sessions'.
>
>
> Regards,
> Mridul
>
>
>
> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>
>>
>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>
>>> I am not sure what you mean here exactly.
>>> Will a sid row have multiple (different) values for the other
>>> fields ?
>> Yes.
>>>
>>> But if you want to pick any one row for a given sid, then I think
>>> what you have below might be good enough (you can omit the last
>>> line though).
>> OK. Thanks. The last line is used to retrieve the exact same data
>> structure and naming as the original table. This way, I can
>> optionally perform this treatment without modifying my code. If you
>> know a better way...
>>
>> Cheers,
>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>     Hello everybody,
>>>>
>>>> I have a simple table containing sessions. Each sessions has an
>>>> unique key (the sid, which is actually a uuid).
>>>> But a session can be present several times in my input table.
>>>>
>>>> I want to ensure that I only have 1 record for each sid (because I
>>>> perform subsequent JOIN based on this sid).
>>>>
>>>> Currently I use the following script, but I wonder if there is
>>>> something more efficient:
>>>>
>>>> sessions = GROUP sessions BY sid;
>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>> FLATTEN(first);};
>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>>> have in the session table...
>>>>
>>>> Do you see any optimization I can do, especially on the FLATTEN /
>>>> GENERATE part ?
>>>>
>>>> Thank you very much for your help.
>>>
>>>
>
>

Re: Any better way to ensure unicity ?

Posted by Mridul Muralidharan <mr...@YAHOO-INC.COM>.

The flatten will return the same schema as before (in 'first') : so 
unless you are modifying the fields or the order in which they are 
generated (which I dont think you are in view of your comment that it 
should work with and without this), you can simply go with :

-- Or whatever works for you.
%define PARALLELISM		'10'

sessions = DISTINCT sessions PARALLEL $PARALLELISM;

OR

sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE 
FLATTEN(first);};




The schema at the end would be exactly same as start of the code snippet 
for 'sessions'.


Regards,
Mridul



On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>
>
> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>
>> I am not sure what you mean here exactly.
>> Will a sid row have multiple (different) values for the other
>> fields ?
> Yes.
>>
>> But if you want to pick any one row for a given sid, then I think
>> what you have below might be good enough (you can omit the last
>> line though).
> OK. Thanks. The last line is used to retrieve the exact same data
> structure and naming as the original table. This way, I can
> optionally perform this treatment without modifying my code. If you
> know a better way...
>
> Cheers,
>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>     Hello everybody,
>>>
>>> I have a simple table containing sessions. Each sessions has an
>>> unique key (the sid, which is actually a uuid).
>>> But a session can be present several times in my input table.
>>>
>>> I want to ensure that I only have 1 record for each sid (because I
>>> perform subsequent JOIN based on this sid).
>>>
>>> Currently I use the following script, but I wonder if there is
>>> something more efficient:
>>>
>>> sessions = GROUP sessions BY sid;
>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>> FLATTEN(first);};
>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>> have in the session table...
>>>
>>> Do you see any optimization I can do, especially on the FLATTEN /
>>> GENERATE part ?
>>>
>>> Thank you very much for your help.
>>
>>

Re: Any better way to ensure unicity ?

Posted by Vincent Barat <vb...@ubikod.com>.


Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>
> I am not sure what you mean here exactly.
> Will a sid row have multiple (different) values for the other 
> fields ?
Yes.
>
> But if you want to pick any one row for a given sid, then I think 
> what you have below might be good enough (you can omit the last 
> line though).
OK. Thanks. The last line is used to retrieve the exact same data 
structure and naming as the original table. This way, I can 
optionally perform this treatment without modifying my code. If you 
know a better way...

Cheers,

>
> Regards,
> Mridul
>
>
>
> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>    Hello everybody,
>>
>> I have a simple table containing sessions. Each sessions has an
>> unique key (the sid, which is actually a uuid).
>> But a session can be present several times in my input table.
>>
>> I want to ensure that I only have 1 record for each sid (because I
>> perform subsequent JOIN based on this sid).
>>
>> Currently I use the following script, but I wonder if there is
>> something more efficient:
>>
>> sessions = GROUP sessions BY sid;
>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> FLATTEN(first);};
>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>> have in the session table...
>>
>> Do you see any optimization I can do, especially on the FLATTEN /
>> GENERATE part ?
>>
>> Thank you very much for your help.
>
>

Re: Any better way to ensure unicity ?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

I am not sure what you mean here exactly.
Will a sid row have multiple (different) values for the other fields ?


If not, that is, you can simply have duplicates for rows : you can use 
DISTINCT to achieve what you require :

sessions = DISTINCT sessions PARALLEL $PARALLELISM;



But if you want to pick any one row for a given sid, then I think what 
you have below might be good enough (you can omit the last line though).


Regards,
Mridul



On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>    Hello everybody,
>
> I have a simple table containing sessions. Each sessions has an
> unique key (the sid, which is actually a uuid).
> But a session can be present several times in my input table.
>
> I want to ensure that I only have 1 record for each sid (because I
> perform subsequent JOIN based on this sid).
>
> Currently I use the following script, but I wonder if there is
> something more efficient:
>
> sessions = GROUP sessions BY sid;
> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> FLATTEN(first);};
> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
> have in the session table...
>
> Do you see any optimization I can do, especially on the FLATTEN /
> GENERATE part ?
>
> Thank you very much for your help.