You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by hc busy <hc...@gmail.com> on 2010/04/23 20:48:26 UTC

How do I generate a row id?

Guys, is there a easy way to generate a unique row id that is guaranteed to
be unique?

R = foreach T generate *, globally_unique() as id;

The reason why I need this is because I have a really nasty memory problem
here and I can't perform a group on the entire row, so all I can resort to
is to split the alias into two aliases

A1 = foreach R generate keys, id;
A2 = foreach R generate values, id;


and operate on my A1, and then come back for the rest of the values later.


But it is important for the id's to be generated globally unique so that
different mappers don't all start at 1. Any suggestions?


Thnx!

Re: How do I generate a row id?

Posted by hc busy <hc...@gmail.com>.
No, no you misunderstand. I didn't mean to contact zookeeper for every
single record.

Each map instance will contact zookeeper once for every X number of records
it sees. What the mapper portion does is it gets a block of numbers, and
that block number become only available to that one mapper, the next mapper
to get its block will get a different block.

so this IDing system can work in Integer space (since I'm memory
constrained)... Hmmm, I thought pig or some parts of hadoop had already been
using zookeeper... anyways.... I guess it doesn't even have to be zookeeper,
just a transactional database...

I'll probably end up using random numbers or UUID as you suggest..., after
trying the synchronized version. hehe ;-)



On Fri, Apr 23, 2010 at 12:23 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

>
> On Apr 23, 2010, at 12:13 PM, hc busy wrote:
>
>  Is the Java class guaranteed to be unique? Or will I have to perform an
>> additional check after I join back?
>>
>
> I'd check the Java docs, but AFAIK it is guaranteed.
>
> I don't know the performance of UUID vs Zookeeper, nor how Zookeeper
> generates its UUIDs.  You could ask on that list.  Pig does not currently
> have integration with Zookeeper.
>
> Alan.
>
>
>
>> I guess I see how I can connect to a zookeeper server inside my UDF to get
>> a
>> block of, say 50k, Id's at a time and sequentially increase within the
>> block. Then the UDF connects again to get another block. This way I can
>> get
>> a guaranteed unique ID. (And it's probably faster and smaller this way
>> than
>> generating UUID)
>>
>> Does pig use zookeeper to do anything? Can I connect to that one if it
>> does?
>>
>>
>>
>> On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>  Unique identifiers are easy enough.  Row ids (monotonically increasing
>>> values) are impossible because of the parallel nature of map reduce.  If
>>> you
>>> just want to generate a unique identifier you can write a UDF to wrap
>>> Java's
>>> UUID class (or use the new GenericInvoker UDF if you're working off
>>> trunk).
>>>
>>> Alan.
>>>
>>>
>>> On Apr 23, 2010, at 11:48 AM, hc busy wrote:
>>>
>>> Guys, is there a easy way to generate a unique row id that is guaranteed
>>>
>>>> to
>>>> be unique?
>>>>
>>>> R = foreach T generate *, globally_unique() as id;
>>>>
>>>> The reason why I need this is because I have a really nasty memory
>>>> problem
>>>> here and I can't perform a group on the entire row, so all I can resort
>>>> to
>>>> is to split the alias into two aliases
>>>>
>>>> A1 = foreach R generate keys, id;
>>>> A2 = foreach R generate values, id;
>>>>
>>>>
>>>> and operate on my A1, and then come back for the rest of the values
>>>> later.
>>>>
>>>>
>>>> But it is important for the id's to be generated globally unique so that
>>>> different mappers don't all start at 1. Any suggestions?
>>>>
>>>>
>>>> Thnx!
>>>>
>>>>
>>>
>>>
>

Re: How do I generate a row id?

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Apr 23, 2010, at 12:13 PM, hc busy wrote:

> Is the Java class guaranteed to be unique? Or will I have to perform  
> an
> additional check after I join back?

I'd check the Java docs, but AFAIK it is guaranteed.

I don't know the performance of UUID vs Zookeeper, nor how Zookeeper  
generates its UUIDs.  You could ask on that list.  Pig does not  
currently have integration with Zookeeper.

Alan.

>
> I guess I see how I can connect to a zookeeper server inside my UDF  
> to get a
> block of, say 50k, Id's at a time and sequentially increase within the
> block. Then the UDF connects again to get another block. This way I  
> can get
> a guaranteed unique ID. (And it's probably faster and smaller this  
> way than
> generating UUID)
>
> Does pig use zookeeper to do anything? Can I connect to that one if  
> it does?
>
>
>
> On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates <ga...@yahoo-inc.com>  
> wrote:
>
>> Unique identifiers are easy enough.  Row ids (monotonically  
>> increasing
>> values) are impossible because of the parallel nature of map  
>> reduce.  If you
>> just want to generate a unique identifier you can write a UDF to  
>> wrap Java's
>> UUID class (or use the new GenericInvoker UDF if you're working off  
>> trunk).
>>
>> Alan.
>>
>>
>> On Apr 23, 2010, at 11:48 AM, hc busy wrote:
>>
>> Guys, is there a easy way to generate a unique row id that is  
>> guaranteed
>>> to
>>> be unique?
>>>
>>> R = foreach T generate *, globally_unique() as id;
>>>
>>> The reason why I need this is because I have a really nasty memory  
>>> problem
>>> here and I can't perform a group on the entire row, so all I can  
>>> resort to
>>> is to split the alias into two aliases
>>>
>>> A1 = foreach R generate keys, id;
>>> A2 = foreach R generate values, id;
>>>
>>>
>>> and operate on my A1, and then come back for the rest of the  
>>> values later.
>>>
>>>
>>> But it is important for the id's to be generated globally unique  
>>> so that
>>> different mappers don't all start at 1. Any suggestions?
>>>
>>>
>>> Thnx!
>>>
>>
>>


Re: How do I generate a row id?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
You can certainly connect to zookeeper but you don't really need to (relying
on zookeeper to do atomic increments may not scale if you are doing this for
millions of records.. though I haven't done timings. Y! people?)

Just grab the task id from the jobconf and use it as a uuid prefix.  Details
about UUID uniqueness properties can be found here:
http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html

-D

On Fri, Apr 23, 2010 at 12:13 PM, hc busy <hc...@gmail.com> wrote:

> Is the Java class guaranteed to be unique? Or will I have to perform an
> additional check after I join back?
>
> I guess I see how I can connect to a zookeeper server inside my UDF to get
> a
> block of, say 50k, Id's at a time and sequentially increase within the
> block. Then the UDF connects again to get another block. This way I can get
> a guaranteed unique ID. (And it's probably faster and smaller this way than
> generating UUID)
>
> Does pig use zookeeper to do anything? Can I connect to that one if it
> does?
>
>
>
> On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> > Unique identifiers are easy enough.  Row ids (monotonically increasing
> > values) are impossible because of the parallel nature of map reduce.  If
> you
> > just want to generate a unique identifier you can write a UDF to wrap
> Java's
> > UUID class (or use the new GenericInvoker UDF if you're working off
> trunk).
> >
> > Alan.
> >
> >
> > On Apr 23, 2010, at 11:48 AM, hc busy wrote:
> >
> >  Guys, is there a easy way to generate a unique row id that is guaranteed
> >> to
> >> be unique?
> >>
> >> R = foreach T generate *, globally_unique() as id;
> >>
> >> The reason why I need this is because I have a really nasty memory
> problem
> >> here and I can't perform a group on the entire row, so all I can resort
> to
> >> is to split the alias into two aliases
> >>
> >> A1 = foreach R generate keys, id;
> >> A2 = foreach R generate values, id;
> >>
> >>
> >> and operate on my A1, and then come back for the rest of the values
> later.
> >>
> >>
> >> But it is important for the id's to be generated globally unique so that
> >> different mappers don't all start at 1. Any suggestions?
> >>
> >>
> >> Thnx!
> >>
> >
> >
>

Re: How do I generate a row id?

Posted by hc busy <hc...@gmail.com>.
Is the Java class guaranteed to be unique? Or will I have to perform an
additional check after I join back?

I guess I see how I can connect to a zookeeper server inside my UDF to get a
block of, say 50k, Id's at a time and sequentially increase within the
block. Then the UDF connects again to get another block. This way I can get
a guaranteed unique ID. (And it's probably faster and smaller this way than
generating UUID)

Does pig use zookeeper to do anything? Can I connect to that one if it does?



On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Unique identifiers are easy enough.  Row ids (monotonically increasing
> values) are impossible because of the parallel nature of map reduce.  If you
> just want to generate a unique identifier you can write a UDF to wrap Java's
> UUID class (or use the new GenericInvoker UDF if you're working off trunk).
>
> Alan.
>
>
> On Apr 23, 2010, at 11:48 AM, hc busy wrote:
>
>  Guys, is there a easy way to generate a unique row id that is guaranteed
>> to
>> be unique?
>>
>> R = foreach T generate *, globally_unique() as id;
>>
>> The reason why I need this is because I have a really nasty memory problem
>> here and I can't perform a group on the entire row, so all I can resort to
>> is to split the alias into two aliases
>>
>> A1 = foreach R generate keys, id;
>> A2 = foreach R generate values, id;
>>
>>
>> and operate on my A1, and then come back for the rest of the values later.
>>
>>
>> But it is important for the id's to be generated globally unique so that
>> different mappers don't all start at 1. Any suggestions?
>>
>>
>> Thnx!
>>
>
>

Re: How do I generate a row id?

Posted by Alan Gates <ga...@yahoo-inc.com>.
Unique identifiers are easy enough.  Row ids (monotonically increasing  
values) are impossible because of the parallel nature of map reduce.   
If you just want to generate a unique identifier you can write a UDF  
to wrap Java's UUID class (or use the new GenericInvoker UDF if you're  
working off trunk).

Alan.

On Apr 23, 2010, at 11:48 AM, hc busy wrote:

> Guys, is there a easy way to generate a unique row id that is  
> guaranteed to
> be unique?
>
> R = foreach T generate *, globally_unique() as id;
>
> The reason why I need this is because I have a really nasty memory  
> problem
> here and I can't perform a group on the entire row, so all I can  
> resort to
> is to split the alias into two aliases
>
> A1 = foreach R generate keys, id;
> A2 = foreach R generate values, id;
>
>
> and operate on my A1, and then come back for the rest of the values  
> later.
>
>
> But it is important for the id's to be generated globally unique so  
> that
> different mappers don't all start at 1. Any suggestions?
>
>
> Thnx!