You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Andrew Ebaugh <ae...@real.com> on 2011/03/01 02:14:53 UTC

Re: Zookeeper for generating sequential IDs

Getting a bit into Cassandra weeds, but what about using a super column
and TimeUUIDType keys? IMO splitting data for one unique item into
multiple manipulated keys sounds complex, and more what a super column was
made for.

So instead of having:

TimeIDA-Name -> {name column data}
TimeIDA-Blah -> {blah column data}
TimdIDB-Name -> ...

You'd have :

TimeIDA ->
 {
  Name -> {name column data}
  Blah -> {blah column data}
 }

TimeIDB ->
  ...


This would give you the advantage of being able to query key slices based
on time ranges.
Here's a good article (seems a bit outdated for 0.7):
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model



On 2/28/11 9:50 AM, "Ertio Lew" <er...@gmail.com> wrote:

>Thanks Jeff !
>
>Your point is truly valid! However... even my idea is "not to store
>information about the data/entities in the Id" but to split the
>several data of an entity into several rows(according to category of
>that data) in same CF in Cassandra.
>So for e.g. if you want to split the information about a tweet in two
>rows according to the 'type of information', then you want two keys
>generated using the same ID.
>
>For this purpose you definitely need to have some kind of manipulation
>required with your Ids. Or otherwise you cannot split the data for a
>particular entity (in same CF) in two rows, according to data
>category. Of course you can also suggest to store different types of
>data in different CFs but sometimes it is more optimal to keep a limit
>on the no of CFs in Cassandra.
>
>Regards
>Ertio Lew
>
>
>
>On Mon, Feb 28, 2011 at 10:50 PM, Jeff Hodges <jh...@twitter.com> wrote:
>> Also, feel free to mock me for the phrase "identifying id".
>>
>> On Mon, Feb 28, 2011 at 9:04 AM, Jeff Hodges <jh...@twitter.com>
>>wrote:
>>> If you patch snowflake to remove 4 bits from the timestamp section,
>>> you will take the time that it takes before the IDs generated overflow
>>> the JVM 63-bit limit from about 70 years (2 ** 41 milliseconds) to a
>>> little over 4 years (2 ** 37 milliseconds). This is likely
>>> unacceptable for your use case.
>>>
>>> However, the larger point to discuss is that encoding additional
>>> information about your data in the identifying id is, in general, a
>>> bad idea. It means your architecture is strictly coupled to your
>>> current and likely less-than-perfect understanding of the problem and
>>> makes it harder to iterate towards a better one. For instance, we had
>>> to rewrite certain parts of our search infrastructure when migrating
>>> to snowflake because it had assumed that the generated id space of
>>> tweets was uniform across time.
>>>
>>> But, of course, I'm just some dude on the internet who doesn't know
>>> your particular problem or design in detail. God speed and good luck.
>>>
>>> On Mon, Feb 28, 2011 at 8:35 AM, Ertio Lew <er...@gmail.com> wrote:
>>>> Yes I think we could perhaps reduce the micro seconds precision
>>>> provided by it(I think 41 bits) to an appropriate extent to match our
>>>> needs.
>>>>
>>>> On Mon, Feb 28, 2011 at 9:38 PM, Ted Dunning <te...@gmail.com>
>>>>wrote:
>>>>> So patch it!
>>>>>
>>>>> On Mon, Feb 28, 2011 at 7:59 AM, Ertio Lew <er...@gmail.com>
>>>>>wrote:
>>>>>
>>>>>> First that it does not start at 0 since it comprises timestamp,
>>>>>> workerId and noOfGeneratedIds.
>>>>>> Thus it is not sequential! Secondly if I insert my 4 bits into this
>>>>>>ID
>>>>>> then I risk* that it might overwrite the already existing ID created
>>>>>> by it.
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 9:16 PM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>> > Uh.... any sequential generator that starts at zero will take a
>>>>>>LONG time
>>>>>> > until it generates a value > 2^60.
>>>>>> >
>>>>>> > If you generator a million id's per second (= 2^20) then it will
>>>>>>be
>>>>>> longer
>>>>>> > than 30,000 years before you get past 2^60.
>>>>>> >
>>>>>> > Is this *really* a problem?
>>>>>> >
>>>>>> > On Mon, Feb 28, 2011 at 7:25 AM, Ertio Lew <er...@gmail.com>
>>>>>>wrote:
>>>>>> >
>>>>>> >> Could you recommend any other ID generator that could help me
>>>>>>with
>>>>>> >> increasing Ids(not necessarily sequential) with size<= 60 bits ?
>>>>>> >>
>>>>>> >> Thanks
>>>>>> >>
>>>>>> >>
>>>>>> >> On Mon, Feb 28, 2011 at 8:30 PM, Ertio Lew <er...@gmail.com>
>>>>>>wrote:
>>>>>> >> > Thanks Patrick,
>>>>>> >> >
>>>>>> >> > I considered your suggestion. But sadly it could not fit my
>>>>>>use case.
>>>>>> >> > I am looking for a solution that could help me generate 64
>>>>>>bits Ids
>>>>>> >> > but in those 64 bits I would like atleast 4 free bits so that
>>>>>>I could
>>>>>> >> > manage with those free bits to distinguish the type of data
>>>>>>for a
>>>>>> >> > particular entity in the same columnfamily.
>>>>>> >> >
>>>>>> >> > If I could keep the snowflake's Id size to around 60 bits,
>>>>>>that would
>>>>>> >> > have been great..
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On Sat, Feb 26, 2011 at 5:13 AM, Patrick Hunt
>>>>>><ph...@apache.org>
>>>>>> wrote:
>>>>>> >> >> Keep in mind that blog post is pretty old. I see comments
>>>>>>like this
>>>>>> in
>>>>>> >> >> the commit log
>>>>>> >> >>
>>>>>> >> >> "hard to call it alpha/experimental after serving billions of
>>>>>>ids"
>>>>>> >> >>
>>>>>> >> >> so it seems it's in production at twitter at least...
>>>>>> >> >>
>>>>>> >> >> Patrick
>>>>>> >> >>
>>>>>> >> >> On Fri, Feb 25, 2011 at 2:58 PM, Ertio Lew
>>>>>><er...@gmail.com>
>>>>>> wrote:
>>>>>> >> >>> Thanks Patrick,
>>>>>> >> >>>
>>>>>> >> >>> The fact that it is still in the alpha stage and twitter is
>>>>>>not yet
>>>>>> >> >>> using it, makes me look to other solutions as well, which
>>>>>>have a
>>>>>> large
>>>>>> >> >>> community/users base & are more mature.
>>>>>> >> >>>
>>>>>> >> >>> I do not know much about the snowflake if it is being used in
>>>>>> >> >>> production by anyone ..
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >> >>> On Fri, Feb 25, 2011 at 11:21 PM, Patrick Hunt
>>>>>><ph...@apache.org>
>>>>>> >> wrote:
>>>>>> >> >>>> Have you looked at snowflake?
>>>>>> >> >>>>
>>>>>> >> >>>> 
>>>>>>http://engineering.twitter.com/2010/06/announcing-snowflake.html
>>>>>> >> >>>>
>>>>>> >> >>>> Patrick
>>>>>> >> >>>>
>>>>>> >> >>>> On Fri, Feb 25, 2011 at 9:43 AM, Ted Dunning <
>>>>>> ted.dunning@gmail.com>
>>>>>> >> wrote:
>>>>>> >> >>>>> If your id's don't need to be exactly sequential or if the
>>>>>> generation
>>>>>> >> rate
>>>>>> >> >>>>> is less than a few thousand per second, ZK is a fine
>>>>>>choice.
>>>>>> >> >>>>>
>>>>>> >> >>>>> To get very high generation rates, what is typically done
>>>>>>is to
>>>>>> >> allocate
>>>>>> >> >>>>> blocks of id's using ZK and then allocate out of the block
>>>>>> locally.
>>>>>> >>  This
>>>>>> >> >>>>> can cause you to wind up with a slightly swiss-cheesed id
>>>>>>space
>>>>>> and
>>>>>> >> it means
>>>>>> >> >>>>> that the ordering of id's only approximates the time
>>>>>>ordering of
>>>>>> when
>>>>>> >> the
>>>>>> >> >>>>> id's were assigned.  Neither of these is typically a
>>>>>>problem.
>>>>>> >> >>>>>
>>>>>> >> >>>>> On Fri, Feb 25, 2011 at 1:50 AM, Ertio Lew
>>>>>><er...@gmail.com>
>>>>>> >> wrote:
>>>>>> >> >>>>>
>>>>>> >> >>>>>> Hi all,
>>>>>> >> >>>>>>
>>>>>> >> >>>>>> I am involved in a project where we're building a social
>>>>>> application
>>>>>> >> >>>>>> using Cassandra DB and Java. I am looking for a solution
>>>>>>to
>>>>>> generate
>>>>>> >> >>>>>> unique sequential IDs for the content on the application.
>>>>>>I have
>>>>>> >> been
>>>>>> >> >>>>>> suggested by some people to have a look  to Zookeeper for
>>>>>>this. I
>>>>>> >> >>>>>> would highly appreciate if anyone can suggest if
>>>>>>zookeeper is
>>>>>> >> suitable
>>>>>> >> >>>>>> for this purpose and any good resources to gain
>>>>>>information about
>>>>>> >> >>>>>> zookeeper.
>>>>>> >> >>>>>>
>>>>>> >> >>>>>> Since the application is based on a eventually consistent
>>>>>> >> distributed
>>>>>> >> >>>>>> platform using Cassandra, we have felt a need to look
>>>>>>over to
>>>>>> other
>>>>>> >> >>>>>> solutions instead of building our own using our DB.
>>>>>> >> >>>>>>
>>>>>> >> >>>>>> Any kind of comments, suggestions are highly welcomed! :)
>>>>>> >> >>>>>>
>>>>>> >> >>>>>> Regards
>>>>>> >> >>>>>> Ertio Lew.
>>>>>> >> >>>>>>
>>>>>> >> >>>>>
>>>>>> >> >>>>
>>>>>> >> >>>
>>>>>> >> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Zookeeper for generating sequential IDs

Posted by Ertio Lew <er...@gmail.com>.

Thanks Andrew, however I would prefer to stay away from supercolumns
because of their well known limitations.

Regarding the snowflake I think I can make it useful for me by
limiting the currently 12 bits sequence no. to 8 bits and using the
saved up 4 bits to store the category of data. Thus I would be
reducing the theoritical limit of 4096 ids per millisecond per machine
to 256 ids per ms per machine. Sounds too good for my use case..

@Jeff,  Would you like to say something on this idea ??

Thank you all..
Ertio



On Tue, Mar 1, 2011 at 6:44 AM, Andrew Ebaugh <ae...@real.com> wrote:
> Getting a bit into Cassandra weeds, but what about using a super column
> and TimeUUIDType keys? IMO splitting data for one unique item into
> multiple manipulated keys sounds complex, and more what a super column was
> made for.
>
> So instead of having:
>
> TimeIDA-Name -> {name column data}
> TimeIDA-Blah -> {blah column data}
> TimdIDB-Name -> ...
>
> You'd have :
>
> TimeIDA ->
>  {
>  Name -> {name column data}
>  Blah -> {blah column data}
>  }
>
> TimeIDB ->
>  ...
>
>
> This would give you the advantage of being able to query key slices based
> on time ranges.
> Here's a good article (seems a bit outdated for 0.7):
> http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
>
>
>
> On 2/28/11 9:50 AM, "Ertio Lew" <er...@gmail.com> wrote:
>
>>Thanks Jeff !
>>
>>Your point is truly valid! However... even my idea is "not to store
>>information about the data/entities in the Id" but to split the
>>several data of an entity into several rows(according to category of
>>that data) in same CF in Cassandra.
>>So for e.g. if you want to split the information about a tweet in two
>>rows according to the 'type of information', then you want two keys
>>generated using the same ID.
>>
>>For this purpose you definitely need to have some kind of manipulation
>>required with your Ids. Or otherwise you cannot split the data for a
>>particular entity (in same CF) in two rows, according to data
>>category. Of course you can also suggest to store different types of
>>data in different CFs but sometimes it is more optimal to keep a limit
>>on the no of CFs in Cassandra.
>>
>>Regards
>>Ertio Lew
>>
>>
>>
>>On Mon, Feb 28, 2011 at 10:50 PM, Jeff Hodges <jh...@twitter.com> wrote:
>>> Also, feel free to mock me for the phrase "identifying id".
>>>
>>> On Mon, Feb 28, 2011 at 9:04 AM, Jeff Hodges <jh...@twitter.com>
>>>wrote:
>>>> If you patch snowflake to remove 4 bits from the timestamp section,
>>>> you will take the time that it takes before the IDs generated overflow
>>>> the JVM 63-bit limit from about 70 years (2 ** 41 milliseconds) to a
>>>> little over 4 years (2 ** 37 milliseconds). This is likely
>>>> unacceptable for your use case.
>>>>
>>>> However, the larger point to discuss is that encoding additional
>>>> information about your data in the identifying id is, in general, a
>>>> bad idea. It means your architecture is strictly coupled to your
>>>> current and likely less-than-perfect understanding of the problem and
>>>> makes it harder to iterate towards a better one. For instance, we had
>>>> to rewrite certain parts of our search infrastructure when migrating
>>>> to snowflake because it had assumed that the generated id space of
>>>> tweets was uniform across time.
>>>>
>>>> But, of course, I'm just some dude on the internet who doesn't know
>>>> your particular problem or design in detail. God speed and good luck.
>>>>
>>>> On Mon, Feb 28, 2011 at 8:35 AM, Ertio Lew <er...@gmail.com> wrote:
>>>>> Yes I think we could perhaps reduce the micro seconds precision
>>>>> provided by it(I think 41 bits) to an appropriate extent to match our
>>>>> needs.
>>>>>
>>>>> On Mon, Feb 28, 2011 at 9:38 PM, Ted Dunning <te...@gmail.com>
>>>>>wrote:
>>>>>> So patch it!
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 7:59 AM, Ertio Lew <er...@gmail.com>
>>>>>>wrote:
>>>>>>
>>>>>>> First that it does not start at 0 since it comprises timestamp,
>>>>>>> workerId and noOfGeneratedIds.
>>>>>>> Thus it is not sequential! Secondly if I insert my 4 bits into this
>>>>>>>ID
>>>>>>> then I risk* that it might overwrite the already existing ID created
>>>>>>> by it.
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 9:16 PM, Ted Dunning <te...@gmail.com>
>>>>>>> wrote:
>>>>>>> > Uh.... any sequential generator that starts at zero will take a
>>>>>>>LONG time
>>>>>>> > until it generates a value > 2^60.
>>>>>>> >
>>>>>>> > If you generator a million id's per second (= 2^20) then it will
>>>>>>>be
>>>>>>> longer
>>>>>>> > than 30,000 years before you get past 2^60.
>>>>>>> >
>>>>>>> > Is this *really* a problem?
>>>>>>> >
>>>>>>> > On Mon, Feb 28, 2011 at 7:25 AM, Ertio Lew <er...@gmail.com>
>>>>>>>wrote:
>>>>>>> >
>>>>>>> >> Could you recommend any other ID generator that could help me
>>>>>>>with
>>>>>>> >> increasing Ids(not necessarily sequential) with size<= 60 bits ?
>>>>>>> >>
>>>>>>> >> Thanks
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Mon, Feb 28, 2011 at 8:30 PM, Ertio Lew <er...@gmail.com>
>>>>>>>wrote:
>>>>>>> >> > Thanks Patrick,
>>>>>>> >> >
>>>>>>> >> > I considered your suggestion. But sadly it could not fit my
>>>>>>>use case.
>>>>>>> >> > I am looking for a solution that could help me generate 64
>>>>>>>bits Ids
>>>>>>> >> > but in those 64 bits I would like atleast 4 free bits so that
>>>>>>>I could
>>>>>>> >> > manage with those free bits to distinguish the type of data
>>>>>>>for a
>>>>>>> >> > particular entity in the same columnfamily.
>>>>>>> >> >
>>>>>>> >> > If I could keep the snowflake's Id size to around 60 bits,
>>>>>>>that would
>>>>>>> >> > have been great..
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > On Sat, Feb 26, 2011 at 5:13 AM, Patrick Hunt
>>>>>>><ph...@apache.org>
>>>>>>> wrote:
>>>>>>> >> >> Keep in mind that blog post is pretty old. I see comments
>>>>>>>like this
>>>>>>> in
>>>>>>> >> >> the commit log
>>>>>>> >> >>
>>>>>>> >> >> "hard to call it alpha/experimental after serving billions of
>>>>>>>ids"
>>>>>>> >> >>
>>>>>>> >> >> so it seems it's in production at twitter at least...
>>>>>>> >> >>
>>>>>>> >> >> Patrick
>>>>>>> >> >>
>>>>>>> >> >> On Fri, Feb 25, 2011 at 2:58 PM, Ertio Lew
>>>>>>><er...@gmail.com>
>>>>>>> wrote:
>>>>>>> >> >>> Thanks Patrick,
>>>>>>> >> >>>
>>>>>>> >> >>> The fact that it is still in the alpha stage and twitter is
>>>>>>>not yet
>>>>>>> >> >>> using it, makes me look to other solutions as well, which
>>>>>>>have a
>>>>>>> large
>>>>>>> >> >>> community/users base & are more mature.
>>>>>>> >> >>>
>>>>>>> >> >>> I do not know much about the snowflake if it is being used in
>>>>>>> >> >>> production by anyone ..
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>> On Fri, Feb 25, 2011 at 11:21 PM, Patrick Hunt
>>>>>>><ph...@apache.org>
>>>>>>> >> wrote:
>>>>>>> >> >>>> Have you looked at snowflake?
>>>>>>> >> >>>>
>>>>>>> >> >>>>
>>>>>>>http://engineering.twitter.com/2010/06/announcing-snowflake.html
>>>>>>> >> >>>>
>>>>>>> >> >>>> Patrick
>>>>>>> >> >>>>
>>>>>>> >> >>>> On Fri, Feb 25, 2011 at 9:43 AM, Ted Dunning <
>>>>>>> ted.dunning@gmail.com>
>>>>>>> >> wrote:
>>>>>>> >> >>>>> If your id's don't need to be exactly sequential or if the
>>>>>>> generation
>>>>>>> >> rate
>>>>>>> >> >>>>> is less than a few thousand per second, ZK is a fine
>>>>>>>choice.
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> To get very high generation rates, what is typically done
>>>>>>>is to
>>>>>>> >> allocate
>>>>>>> >> >>>>> blocks of id's using ZK and then allocate out of the block
>>>>>>> locally.
>>>>>>> >>  This
>>>>>>> >> >>>>> can cause you to wind up with a slightly swiss-cheesed id
>>>>>>>space
>>>>>>> and
>>>>>>> >> it means
>>>>>>> >> >>>>> that the ordering of id's only approximates the time
>>>>>>>ordering of
>>>>>>> when
>>>>>>> >> the
>>>>>>> >> >>>>> id's were assigned.  Neither of these is typically a
>>>>>>>problem.
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> On Fri, Feb 25, 2011 at 1:50 AM, Ertio Lew
>>>>>>><er...@gmail.com>
>>>>>>> >> wrote:
>>>>>>> >> >>>>>
>>>>>>> >> >>>>>> Hi all,
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> I am involved in a project where we're building a social
>>>>>>> application
>>>>>>> >> >>>>>> using Cassandra DB and Java. I am looking for a solution
>>>>>>>to
>>>>>>> generate
>>>>>>> >> >>>>>> unique sequential IDs for the content on the application.
>>>>>>>I have
>>>>>>> >> been
>>>>>>> >> >>>>>> suggested by some people to have a look  to Zookeeper for
>>>>>>>this. I
>>>>>>> >> >>>>>> would highly appreciate if anyone can suggest if
>>>>>>>zookeeper is
>>>>>>> >> suitable
>>>>>>> >> >>>>>> for this purpose and any good resources to gain
>>>>>>>information about
>>>>>>> >> >>>>>> zookeeper.
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> Since the application is based on a eventually consistent
>>>>>>> >> distributed
>>>>>>> >> >>>>>> platform using Cassandra, we have felt a need to look
>>>>>>>over to
>>>>>>> other
>>>>>>> >> >>>>>> solutions instead of building our own using our DB.
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> Any kind of comments, suggestions are highly welcomed! :)
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> Regards
>>>>>>> >> >>>>>> Ertio Lew.
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>
>>>>>>> >> >>>>
>>>>>>> >> >>>
>>>>>>> >> >>
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
>