You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "mohit.kaushik" <mo...@orkash.com> on 2015/06/23 14:08:06 UTC

How to generate UUID in real time environment for Accumulo

Hi All,

I have an application which can index data at very high rate from 
multiple clients. I need to generate a unique id to store documents.
It Should
(1) use the current system time in millies.
(2) it should be designed to sort lexicographically on the bases of time.
(3) if I just store the currentTimeInMillies than i can just index 1000 
unique docs per sec. It should be able to generate millions of UUID's 
per sec.

I am searching for the best possible approach to implement, any help?
Regards
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012> 
<http://www.linkedin.com/company/orkash-services-private-limited> 
<https://twitter.com/Orkash> <http://www.orkash.com/blog/> 
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential 
business communication. If you are not the intended recipient it may be 
unlawful for you to read, copy, distribute, disclose or otherwise use 
the information in this e-mail. If you have received it in error or are 
not the intended recipient, please destroy it and notify the sender 
immediately. Thank you /




Re: How to generate UUID in real time environment for Accumulo

Posted by Christopher <ct...@apache.org>.
That solution might be prone to duplicates if the same document is
encountered by multiple ingest clients.

Another option might be:

row=<time>_<hash(document)>


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Tue, Jun 23, 2015 at 9:14 AM, Keith Turner <ke...@deenlo.com> wrote:

> Would something like the following work?
>
> row=<time>_<client id>_<client counter>
>
> Where the <client id> is a unique id per client instance, it would be
> allocated once using Zookeeper or an Accumulo Conditional writer when the
> client starts.   The client counter would be an AtomicLong in the client.
>
> On Tue, Jun 23, 2015 at 8:08 AM, mohit.kaushik <mo...@orkash.com>
> wrote:
>
>>  Hi All,
>>
>> I have an application which can index data at very high rate from
>> multiple clients. I need to generate a unique id to store documents.
>> It Should
>> (1) use the current system time in millies.
>> (2) it should be designed to sort lexicographically on the bases of time.
>> (3) if I just store the currentTimeInMillies than i can just index 1000
>> unique docs per sec. It should be able to generate millions of UUID's per
>> sec.
>>
>> I am searching for the best possible approach to implement, any help?
>> Regards
>>
>> * Mohit Kaushik*
>> Software Engineer
>> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
>> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>>
>>  <http://politicomapper.orkash.com>interactive social intelligence at
>> work...
>>
>>  <https://www.facebook.com/Orkash2012>
>> <http://www.linkedin.com/company/orkash-services-private-limited>
>> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
>> <http://www.orkash.com>
>>  <http://www.orkash.com> ... ensuring Assurance in complexity and
>> uncertainty
>>
>> *This message including the attachments, if any, is a confidential
>> business communication. If you are not the intended recipient it may be
>> unlawful for you to read, copy, distribute, disclose or otherwise use the
>> information in this e-mail. If you have received it in error or are not the
>> intended recipient, please destroy it and notify the sender immediately.
>> Thank you *
>>
>>
>>
>

Re: How to generate UUID in real time environment for Accumulo

Posted by Russ Weeks <rw...@newbrightidea.com>.
Hi, Mohit,

I'm not sure what you mean when you say,

> which makes the UUID 32 bit long. I want UUID to be 16 digits long

Do you want the UUID to be 16 bytes long, giving you 128 bits to work with?
Do you care if the UUID is human-readable (ie. ascii text) or not?

If not, take a look at how Twitter's Snowflake[1] project generates IDs. It
follows a form very similar to Keith's suggestion but with a more compact,
binary encoding. It squeezes IDs into 64 bits, allocating (IIRC) 42 bits
for the timestamp, 10 bits for the client id and 12 bits for the client
counter. It uses a custom epoch to ensure that the timestamp won't overflow
for something like 65 years. If you're OK with 128 bit IDs, you don't need
to be too concerned about that.

I'm not saying, bring in this project just to generate your IDs. I'm just
saying that the code is small enough that you could port it pretty easily.

1: https://github.com/twitter/snowflake/tree/snowflake-2010/src

On Wed, Jun 24, 2015 at 11:33 AM Keith Turner <ke...@deenlo.com> wrote:

> Could look into using Lexicoders.  The following program prints out 19.
> However this will vary depending on how many leading 0 bytes the longs
> have, because those are dropped.
>
>     long time = System.currentTimeMillis();
>
>     ListLexicoder<Long> ll =  new ListLexicoder<Long>(new
> ULongLexicoder());
>
>     List<Long> list = Arrays.asList(new Long[3]);
>     list.set(0, time);
>     list.set(1, 123456l);
>     list.set(2, 987654l);
>
>     byte[] b = ll.encode(list);
>
>     System.out.println(b.length);
>
> On Wed, Jun 24, 2015 at 2:32 AM, mohit.kaushik <mo...@orkash.com>
> wrote:
>
>> On 06/23/2015 06:44 PM, Keith Turner wrote:
>>
>>> row=<time>_<client id>_<client counter>
>>>
>> this will definitely generate a UUID but if I use "14 digits for <time> +
>> 12 digits for <client_id> + say 6 digits for <client_counter>" which makes
>> the UUID 32 bit long. I want UUID to be 16 digits long.
>>
>> Can you suggest some encoding technique which can encode it to 16 digits
>> and also maintains the time order?
>>
>> -Mohit kaushik
>>
>>
>

Re: How to generate UUID in real time environment for Accumulo

Posted by Keith Turner <ke...@deenlo.com>.
Could look into using Lexicoders.  The following program prints out 19.
However this will vary depending on how many leading 0 bytes the longs
have, because those are dropped.

    long time = System.currentTimeMillis();

    ListLexicoder<Long> ll =  new ListLexicoder<Long>(new ULongLexicoder());

    List<Long> list = Arrays.asList(new Long[3]);
    list.set(0, time);
    list.set(1, 123456l);
    list.set(2, 987654l);

    byte[] b = ll.encode(list);

    System.out.println(b.length);

On Wed, Jun 24, 2015 at 2:32 AM, mohit.kaushik <mo...@orkash.com>
wrote:

> On 06/23/2015 06:44 PM, Keith Turner wrote:
>
>> row=<time>_<client id>_<client counter>
>>
> this will definitely generate a UUID but if I use "14 digits for <time> +
> 12 digits for <client_id> + say 6 digits for <client_counter>" which makes
> the UUID 32 bit long. I want UUID to be 16 digits long.
>
> Can you suggest some encoding technique which can encode it to 16 digits
> and also maintains the time order?
>
> -Mohit kaushik
>
>

Re: How to generate UUID in real time environment for Accumulo

Posted by "mohit.kaushik" <mo...@orkash.com>.
On 06/23/2015 06:44 PM, Keith Turner wrote:
> row=<time>_<client id>_<client counter>
this will definitely generate a UUID but if I use "14 digits for <time> 
+ 12 digits for <client_id> + say 6 digits for <client_counter>" which 
makes the UUID 32 bit long. I want UUID to be 16 digits long.

Can you suggest some encoding technique which can encode it to 16 digits 
and also maintains the time order?

-Mohit kaushik


Re: How to generate UUID in real time environment for Accumulo

Posted by Mike Drob <ma...@cloudera.com>.
This sounds super close to a type 1 UUID -
https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_.28MAC_address_.26_date-time.29

On Tue, Jun 23, 2015 at 8:14 AM, Keith Turner <ke...@deenlo.com> wrote:

> Would something like the following work?
>
> row=<time>_<client id>_<client counter>
>
> Where the <client id> is a unique id per client instance, it would be
> allocated once using Zookeeper or an Accumulo Conditional writer when the
> client starts.   The client counter would be an AtomicLong in the client.
>
> On Tue, Jun 23, 2015 at 8:08 AM, mohit.kaushik <mo...@orkash.com>
> wrote:
>
>>  Hi All,
>>
>> I have an application which can index data at very high rate from
>> multiple clients. I need to generate a unique id to store documents.
>> It Should
>> (1) use the current system time in millies.
>> (2) it should be designed to sort lexicographically on the bases of time.
>> (3) if I just store the currentTimeInMillies than i can just index 1000
>> unique docs per sec. It should be able to generate millions of UUID's per
>> sec.
>>
>> I am searching for the best possible approach to implement, any help?
>> Regards
>>
>> * Mohit Kaushik*
>> Software Engineer
>> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
>> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>>
>>  <http://politicomapper.orkash.com>interactive social intelligence at
>> work...
>>
>>  <https://www.facebook.com/Orkash2012>
>> <http://www.linkedin.com/company/orkash-services-private-limited>
>> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
>> <http://www.orkash.com>
>>  <http://www.orkash.com> ... ensuring Assurance in complexity and
>> uncertainty
>>
>> *This message including the attachments, if any, is a confidential
>> business communication. If you are not the intended recipient it may be
>> unlawful for you to read, copy, distribute, disclose or otherwise use the
>> information in this e-mail. If you have received it in error or are not the
>> intended recipient, please destroy it and notify the sender immediately.
>> Thank you *
>>
>>
>>
>

Re: How to generate UUID in real time environment for Accumulo

Posted by Keith Turner <ke...@deenlo.com>.
Would something like the following work?

row=<time>_<client id>_<client counter>

Where the <client id> is a unique id per client instance, it would be
allocated once using Zookeeper or an Accumulo Conditional writer when the
client starts.   The client counter would be an AtomicLong in the client.

On Tue, Jun 23, 2015 at 8:08 AM, mohit.kaushik <mo...@orkash.com>
wrote:

>  Hi All,
>
> I have an application which can index data at very high rate from multiple
> clients. I need to generate a unique id to store documents.
> It Should
> (1) use the current system time in millies.
> (2) it should be designed to sort lexicographically on the bases of time.
> (3) if I just store the currentTimeInMillies than i can just index 1000
> unique docs per sec. It should be able to generate millions of UUID's per
> sec.
>
> I am searching for the best possible approach to implement, any help?
> Regards
>
> * Mohit Kaushik*
> Software Engineer
> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>
>  <http://politicomapper.orkash.com>interactive social intelligence at
> work...
>
>  <https://www.facebook.com/Orkash2012>
> <http://www.linkedin.com/company/orkash-services-private-limited>
> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
> <http://www.orkash.com>
>  <http://www.orkash.com> ... ensuring Assurance in complexity and
> uncertainty
>
> *This message including the attachments, if any, is a confidential
> business communication. If you are not the intended recipient it may be
> unlawful for you to read, copy, distribute, disclose or otherwise use the
> information in this e-mail. If you have received it in error or are not the
> intended recipient, please destroy it and notify the sender immediately.
> Thank you *
>
>
>