You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jonathan Bishop <jb...@gmail.com> on 2012/07/20 18:22:07 UTC

Use of MD5 as row keys - is this safe?

Hi,

I know it is a commonly suggested to use an MD5 checksum to create a row
key from some other identifier, such as a string or long. This is usually
done to guard against hot-spotting and seems to work well.

My concern is that there no guard against collision when this is done - two
different strings or longs could produce the same row-key. Although this is
very unlikely, it is bothersome to consider this possibility for large
systems.

So what I usually do is concatenate the MD5 with the original identifier...

MD5(id) + id

which assures that the rowkey is both randomly distributed and unique.

Is this necessary, or is it the common practice to just use the MD5
checksum itself?

Thanks,

Jon

Re: Use of MD5 as row keys - is this safe?

Posted by Anton Lyska <an...@wildec.com>.
Hi,

I use reversed hex for auto-incremented ids.
For example:
id=123456, row key=042E1
id=123457, row key=142E1
I've started use this approach recently, but it seems it works pretty well.
All regions are distributed uniformly, with no hot-spotting

2012/7/20 Jonathan Bishop <jb...@gmail.com>

> Hi,
>
> I know it is a commonly suggested to use an MD5 checksum to create a row
> key from some other identifier, such as a string or long. This is usually
> done to guard against hot-spotting and seems to work well.
>
> My concern is that there no guard against collision when this is done - two
> different strings or longs could produce the same row-key. Although this is
> very unlikely, it is bothersome to consider this possibility for large
> systems.
>
> So what I usually do is concatenate the MD5 with the original identifier...
>
> MD5(id) + id
>
> which assures that the rowkey is both randomly distributed and unique.
>
> Is this necessary, or is it the common practice to just use the MD5
> checksum itself?
>
> Thanks,
>
> Jon
>

Re: Use of MD5 as row keys - is this safe?

Posted by Amandeep Khurana <am...@gmail.com>.
On Mon, Jul 23, 2012 at 9:58 AM, Jonathan Bishop <jb...@gmail.com>wrote:

> Hi,
> Thanks everyone for the informative discussion on this topic.
>
> I think that for project I am involved in I must remove the risk, however
> small, of a row key collision, and append the original id (in my case a
> long) to the hash, whatever hash I use. I don't want to be in the situation
> where occasionally something goes wrong and needing to eliminate the
> possibility of a collision.
>
> I was confused by a discussion in a book I was reading on HBase, "HBase in
> Action", which used MD5 directly as the row key, leaving the impression
> that this was a completely reliable way of creating unique row keys from
> strings.
>

The book talks about hashing as well as salting. I'll add notes to it about
possible collisions while using hashing. Thanks for pointing this out.


>
> Jon
>

Re: Use of MD5 as row keys - is this safe?

Posted by Jonathan Bishop <jb...@gmail.com>.
Hi,
Thanks everyone for the informative discussion on this topic.

I think that for project I am involved in I must remove the risk, however
small, of a row key collision, and append the original id (in my case a
long) to the hash, whatever hash I use. I don't want to be in the situation
where occasionally something goes wrong and needing to eliminate the
possibility of a collision.

I was confused by a discussion in a book I was reading on HBase, "HBase in
Action", which used MD5 directly as the row key, leaving the impression
that this was a completely reliable way of creating unique row keys from
strings.

Jon

Re: Use of MD5 as row keys - is this safe?

Posted by Michael Segel <mi...@hotmail.com>.
Ethan,

Its not a question of collision attacks, although that is an indication of the strength of the hash. 

Just because both MD5 and SHA-1 yield a key of the same length, that doesn't mean that they will have the same chance of a collision. The algorithm has some impact on this too. 

I agree that if you're looking to hash the key, MD5 should be good enough. If you don't like it, SHA-1 or SHA-2 are also available as part of Sun's JDK. 
(See Appendix A here: http://docs.oracle.com/javase/6/docs/technotes/guides/security/p11guide.html) 

Key's longer than 512 are not exportable from the US. 

The point I was making was that if you are not comfortable in using MD5 and you want to append the key to the hash, you may want to consider a different hash. 

In terms of Uniqueness, take a UUID , hash it, and then truncate the hash to 128 bits from 160 bits. This actually is a form of the UUID. While this could increase the likelihood of a collision, I haven't seen one in real use. 

The idea of creating a hash and then prepending it to the key is a tad pessimistic.

That was the point. 
 


On Jul 22, 2012, at 9:21 AM, Ethan Jewett wrote:

> To echo Joe Pallas:
> 
> Any fairly "random" hash algorithm producing the same length output should
> have about the same extremely small chance of producing the same output for
> two different inputs - a collision. It's a problem you need to be aware of
> no matter what hash algorithm you use. (Hash functions are mappings from a
> theoretically infinite input space to a finitely large output space, so
> they obviously generate the same output for multiple inputs.)
> 
> SHA-1 specifically (and MD5 even more-so) has an attack that shows that
> given a specific input and output, we can calculate a new input that
> produces the same output with better than brute-force efficiency.
> 
> Collisions and collision attacks are two different things. Collision
> attacks are a problem for cryptographic uses like signing, but how does
> this have anything to do with the problem of generating hBase row keys?
> Just use the fastest, most accessible, random-enough algorithm you can
> find, and if you are really worried about collisions then do something to
> ensure that the key will be unique. Right?
> 
> Cheers,
> Ethan
> 
> On Sun, Jul 22, 2012 at 2:00 PM, Michel Segel <mi...@hotmail.com>wrote:
> 
>> http://en.wikipedia.org/wiki/SHA-1
>> 
>> Check out the comparisons between the different SHA algos.
>> 
>> In theory a collision was found for SHA-1, but none found for SHA-2 does
>> that mean that a collision doesn't exist? No, it means that it hasn't
>> happened yet and the odds are that it won't be found. Possible? Yes,
>> however, highly improbable. You have a better chance of winning the lotto...
>> 
>> The point was that if you are going to hash your key,then concatenate the
>> initial key, you would be better off looking at the SHA-1 option. You have
>> to consider a couple of factors...
>> 1: availability of the algo. SHA-1 is in the standard java API and is
>> readily available.
>> 2: speed. Is SHA-1fast enough? Maybe, depending on your requirements. For
>> most, I'll say probably.
>> 3: Size of Key. SHA-1 is probably be smaller than having an MD-5 hash and
>> the original key added.
>> 
>> Just food for thought...
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Jul 20, 2012, at 3:35 PM, Joe Pallas <pa...@cs.stanford.edu> wrote:
>> 
>>> 
>>> On Jul 20, 2012, at 12:16 PM, Michel Segel wrote:
>>> 
>>>> I don't believe that there has been any reports of collisions, but if.
>> You are concerned you could use the SHA-1 for generating the hash.
>> Relatively speaking, SHA-1is slower, but still fast enough for most
>> applications.
>>> 
>>> Every hash function can have collisions, by definition.  If the
>> correctness of your design depends on collisions being impossible, rather
>> than very rare, then your design is faulty.
>>> 
>>> Cryptographic hash functions have the property that it is
>> computationally hard to create inputs that match a given output.  That
>> doesn’t in itself make cryptographic hash functions better than other hash
>> functions for avoiding hot-spotting.  (But it does usually make
>> cryptographic hash functions more expensive to compute than other hash
>> functions.)
>>> 
>>> You may want to look at <http://www.strchr.com/hash_functions>  and <
>> http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633
>>> .
>>> 
>>> Hope this helps,
>>> joe
>>> 
>>> 
>> 


Re: Use of MD5 as row keys - is this safe?

Posted by Ethan Jewett <es...@gmail.com>.
To echo Joe Pallas:

Any fairly "random" hash algorithm producing the same length output should
have about the same extremely small chance of producing the same output for
two different inputs - a collision. It's a problem you need to be aware of
no matter what hash algorithm you use. (Hash functions are mappings from a
theoretically infinite input space to a finitely large output space, so
they obviously generate the same output for multiple inputs.)

SHA-1 specifically (and MD5 even more-so) has an attack that shows that
given a specific input and output, we can calculate a new input that
produces the same output with better than brute-force efficiency.

Collisions and collision attacks are two different things. Collision
attacks are a problem for cryptographic uses like signing, but how does
this have anything to do with the problem of generating hBase row keys?
Just use the fastest, most accessible, random-enough algorithm you can
find, and if you are really worried about collisions then do something to
ensure that the key will be unique. Right?

Cheers,
Ethan

On Sun, Jul 22, 2012 at 2:00 PM, Michel Segel <mi...@hotmail.com>wrote:

> http://en.wikipedia.org/wiki/SHA-1
>
> Check out the comparisons between the different SHA algos.
>
> In theory a collision was found for SHA-1, but none found for SHA-2 does
> that mean that a collision doesn't exist? No, it means that it hasn't
> happened yet and the odds are that it won't be found. Possible? Yes,
> however, highly improbable. You have a better chance of winning the lotto...
>
> The point was that if you are going to hash your key,then concatenate the
> initial key, you would be better off looking at the SHA-1 option. You have
> to consider a couple of factors...
> 1: availability of the algo. SHA-1 is in the standard java API and is
> readily available.
> 2: speed. Is SHA-1fast enough? Maybe, depending on your requirements. For
> most, I'll say probably.
> 3: Size of Key. SHA-1 is probably be smaller than having an MD-5 hash and
> the original key added.
>
> Just food for thought...
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jul 20, 2012, at 3:35 PM, Joe Pallas <pa...@cs.stanford.edu> wrote:
>
> >
> > On Jul 20, 2012, at 12:16 PM, Michel Segel wrote:
> >
> >> I don't believe that there has been any reports of collisions, but if.
> You are concerned you could use the SHA-1 for generating the hash.
> Relatively speaking, SHA-1is slower, but still fast enough for most
> applications.
> >
> > Every hash function can have collisions, by definition.  If the
> correctness of your design depends on collisions being impossible, rather
> than very rare, then your design is faulty.
> >
> > Cryptographic hash functions have the property that it is
> computationally hard to create inputs that match a given output.  That
> doesn’t in itself make cryptographic hash functions better than other hash
> functions for avoiding hot-spotting.  (But it does usually make
> cryptographic hash functions more expensive to compute than other hash
> functions.)
> >
> > You may want to look at <http://www.strchr.com/hash_functions>  and <
> http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633
> >.
> >
> > Hope this helps,
> > joe
> >
> >
>

Re: Use of MD5 as row keys - is this safe?

Posted by Michel Segel <mi...@hotmail.com>.
http://en.wikipedia.org/wiki/SHA-1

Check out the comparisons between the different SHA algos.

In theory a collision was found for SHA-1, but none found for SHA-2 does that mean that a collision doesn't exist? No, it means that it hasn't happened yet and the odds are that it won't be found. Possible? Yes, however, highly improbable. You have a better chance of winning the lotto...

The point was that if you are going to hash your key,then concatenate the initial key, you would be better off looking at the SHA-1 option. You have to consider a couple of factors...
1: availability of the algo. SHA-1 is in the standard java API and is readily available.
2: speed. Is SHA-1fast enough? Maybe, depending on your requirements. For most, I'll say probably.
3: Size of Key. SHA-1 is probably be smaller than having an MD-5 hash and the original key added.

Just food for thought...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 20, 2012, at 3:35 PM, Joe Pallas <pa...@cs.stanford.edu> wrote:

> 
> On Jul 20, 2012, at 12:16 PM, Michel Segel wrote:
> 
>> I don't believe that there has been any reports of collisions, but if. You are concerned you could use the SHA-1 for generating the hash. Relatively speaking, SHA-1is slower, but still fast enough for most applications.
> 
> Every hash function can have collisions, by definition.  If the correctness of your design depends on collisions being impossible, rather than very rare, then your design is faulty.
> 
> Cryptographic hash functions have the property that it is computationally hard to create inputs that match a given output.  That doesn’t in itself make cryptographic hash functions better than other hash functions for avoiding hot-spotting.  (But it does usually make cryptographic hash functions more expensive to compute than other hash functions.)
> 
> You may want to look at <http://www.strchr.com/hash_functions>  and <http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633>.
> 
> Hope this helps,
> joe
> 
> 

Re: Use of MD5 as row keys - is this safe?

Posted by Joe Pallas <pa...@cs.stanford.edu>.
On Jul 20, 2012, at 12:16 PM, Michel Segel wrote:

> I don't believe that there has been any reports of collisions, but if. You are concerned you could use the SHA-1 for generating the hash. Relatively speaking, SHA-1is slower, but still fast enough for most applications.

Every hash function can have collisions, by definition.  If the correctness of your design depends on collisions being impossible, rather than very rare, then your design is faulty.

Cryptographic hash functions have the property that it is computationally hard to create inputs that match a given output.  That doesn’t in itself make cryptographic hash functions better than other hash functions for avoiding hot-spotting.  (But it does usually make cryptographic hash functions more expensive to compute than other hash functions.)

You may want to look at <http://www.strchr.com/hash_functions>  and <http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633>.

Hope this helps,
joe


Re: Use of MD5 as row keys - is this safe?

Posted by Michel Segel <mi...@hotmail.com>.
I don't believe that there has been any reports of collisions, but if. You are concerned you could use the SHA-1 for generating the hash. Relatively speaking, SHA-1is slower, but still fast enough for most applications.

Don't know if it's speed relative to an MD5 and string cat, but it should yield a smaller key.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 20, 2012, at 11:31 AM, Damien Hardy <dh...@figarocms.fr> wrote:

> Le 20/07/2012 18:22, Jonathan Bishop a écrit :
>> Hi,
>> 
>> I know it is a commonly suggested to use an MD5 checksum to create a row
>> key from some other identifier, such as a string or long. This is usually
>> done to guard against hot-spotting and seems to work well.
>> 
>> My concern is that there no guard against collision when this is done - two
>> different strings or longs could produce the same row-key. Although this is
>> very unlikely, it is bothersome to consider this possibility for large
>> systems.
>> 
>> So what I usually do is concatenate the MD5 with the original identifier...
>> 
>> MD5(id) + id
>> 
>> which assures that the rowkey is both randomly distributed and unique.
>> 
>> Is this necessary, or is it the common practice to just use the MD5
>> checksum itself?
>> 
>> Thanks,
>> 
>> Jon
> 
> Hello Jonathan,
> 
> md5(id)+id is the good way to avoid hotspotting and insure uniqueness.
> 
> md5(id)[0]+id could be an other way to limit randomness of the rowid on
> 16 values
> You can now combine (with OR logic) 16 filters in a scanner (on for each
> letter available in md5 digest)
> it limits the balance on 16 potentials regions olso.
> 
> Cheers,
> 
> -- 
> Damien
> 

Re: Use of MD5 as row keys - is this safe?

Posted by Damien Hardy <dh...@figarocms.fr>.
Le 20/07/2012 18:22, Jonathan Bishop a écrit :
> Hi,
>
> I know it is a commonly suggested to use an MD5 checksum to create a row
> key from some other identifier, such as a string or long. This is usually
> done to guard against hot-spotting and seems to work well.
>
> My concern is that there no guard against collision when this is done - two
> different strings or longs could produce the same row-key. Although this is
> very unlikely, it is bothersome to consider this possibility for large
> systems.
>
> So what I usually do is concatenate the MD5 with the original identifier...
>
> MD5(id) + id
>
> which assures that the rowkey is both randomly distributed and unique.
>
> Is this necessary, or is it the common practice to just use the MD5
> checksum itself?
>
> Thanks,
>
> Jon

Hello Jonathan,

md5(id)+id is the good way to avoid hotspotting and insure uniqueness.

md5(id)[0]+id could be an other way to limit randomness of the rowid on
16 values
You can now combine (with OR logic) 16 filters in a scanner (on for each
letter available in md5 digest)
it limits the balance on 16 potentials regions olso.

Cheers,

-- 
Damien