You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dan Lynn <da...@danlynn.com> on 2010/11/15 18:14:42 UTC

hash uniqueKey generation?

Hi,

I just finished reading on the wiki about deduplication and the 
solr.UUIDField type. What I'd like to do is generate an ID for a 
document by hashing a subset of its fields. One route I thought would be 
to do this ahead of time to CSV data, but I would think sticking 
something into the UpdateRequest chain would be more elegant.

Has anyone had any success in this area?

Cheers,
Dan
http://twitter.com/danklynn

Re: hash uniqueKey generation?

Posted by Dan Lynn <da...@danlynn.com>.
Thanks for the feedback, guys!

On 11/15/2010 10:14 AM, Dan Lynn wrote:
> Hi,
>
> I just finished reading on the wiki about deduplication and the 
> solr.UUIDField type. What I'd like to do is generate an ID for a 
> document by hashing a subset of its fields. One route I thought would 
> be to do this ahead of time to CSV data, but I would think sticking 
> something into the UpdateRequest chain would be more elegant.
>
> Has anyone had any success in this area?
>
> Cheers,
> Dan
> http://twitter.com/danklynn


Re: hash uniqueKey generation?

Posted by Chris Hostetter <ho...@fucit.org>.
: I just finished reading on the wiki about deduplication and the solr.UUIDField
: type. What I'd like to do is generate an ID for a document by hashing a subset
: of its fields. One route I thought would be to do this ahead of time to CSV
: data, but I would think sticking something into the UpdateRequest chain would
: be more elegant.
: 
: Has anyone had any success in this area?

what you described is *exactly* how the SignatureUpdateProcessorFactory - 
you specify a list of field names and it uses them to build a hash.

-Hoss

Re: hash uniqueKey generation?

Posted by Lance Norskog <go...@gmail.com>.
Nobody has ever reported seeing a collision 'in the wild' with MD5. It 
is broken, but that takes an algorithm.

As to cosmic rays: it's a real problem. A recent Google paper reported 
that some ram chips will have 1 bit error per gigabit per century, while 
others have that much per hour. I've also seen bit errors on disks. All 
file systems should use checksums.

Yonik Seeley wrote:
> On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon<ge...@sbcglobal.net>  wrote:
>    
>> Read up on WikiPedia, but I believe that no Hash Function is much good above 50%
>> of the address space it generates.
>>      
> 50% is way to high - collisions will happen before that.
>
> But given that something like MD5 has 128 bits, that's 3.4e38, so even
> a small fraction of that address space will work.  The probabilities
> follow the "birthday problem":
> http://en.wikipedia.org/wiki/Birthday_problem
>
> Using a 128 bit hash, you can hash 26B docs with a hash collision
> probability of e-18 (and yes, that is lower than the probability of
> something else going wrong).
>
> It also says: "For comparison, 10-18 to 10-15 is the uncorrectable bit
> error rate of a typical hard disk [2]. In theory, MD5, 128 bits,
> should stay within that range until about 820 billion documents, even
> if its possible outputs are many more."
>
> -Yonik
> http://www.lucidimagination.com
>    

Re: hash uniqueKey generation?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon <ge...@sbcglobal.net> wrote:
> Read up on WikiPedia, but I believe that no Hash Function is much good above 50%
> of the address space it generates.

50% is way to high - collisions will happen before that.

But given that something like MD5 has 128 bits, that's 3.4e38, so even
a small fraction of that address space will work.  The probabilities
follow the "birthday problem":
http://en.wikipedia.org/wiki/Birthday_problem

Using a 128 bit hash, you can hash 26B docs with a hash collision
probability of e-18 (and yes, that is lower than the probability of
something else going wrong).

It also says: "For comparison, 10-18 to 10-15 is the uncorrectable bit
error rate of a typical hard disk [2]. In theory, MD5, 128 bits,
should stay within that range until about 820 billion documents, even
if its possible outputs are many more."

-Yonik
http://www.lucidimagination.com

Re: hash uniqueKey generation?

Posted by Dennis Gearon <ge...@sbcglobal.net>.
Good hash functions almost never have 'collisions' as they are called, 
duplicates, as long as you stay under a certain percentage of the bits for the 
number of entries. 


Read up on WikiPedia, but I believe that no Hash Function is much good above 50% 
of the address space it generates. Many are much worse. Some are exceptional. 
Just know what you are using.

Cosmic rays are not much of a problem at sea level . . . but that changes 
linearly with altitude. The astronauts regulary see micro flashes in there 
vision . . as cosmic rays go through their retina or optical nerves.
 Dennis Gearon 


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die. 



----- Original Message ----
From: Yonik Seeley <yo...@lucidimagination.com>
To: solr-user@lucene.apache.org
Sent: Tue, November 16, 2010 1:46:43 PM
Subject: Re: hash uniqueKey generation?

On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon <ge...@sbcglobal.net> wrote:
> hashing is not 100% guaranteed to produce unique values.

But if you go to enough bits with a good hash function, you can get
the odds lower than the odds of something else changing the value like
cosmic rays flipping a bit on you.

-Yonik
http://www.lucidimagination.com


Re: hash uniqueKey generation?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon <ge...@sbcglobal.net> wrote:
> hashing is not 100% guaranteed to produce unique values.

But if you go to enough bits with a good hash function, you can get
the odds lower than the odds of something else changing the value like
cosmic rays flipping a bit on you.

-Yonik
http://www.lucidimagination.com

Re: hash uniqueKey generation?

Posted by Dennis Gearon <ge...@sbcglobal.net>.
hashing is not 100% guaranteed to produce unique values. 

It'w worth reading about and knowing about :-)

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Lance Norskog <go...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Mon, November 15, 2010 9:45:34 PM
Subject: Re: hash uniqueKey generation?

I think the deduplication signature field will work as a multiValued field. So 
you can do copyField to it from all of the source fields.

Dan Lynn wrote:
> Hi,
> 
> I just finished reading on the wiki about deduplication and the solr.UUIDField 
>type. What I'd like to do is generate an ID for a document by hashing a subset 
>of its fields. One route I thought would be to do this ahead of time to CSV 
>data, but I would think sticking something into the UpdateRequest chain would be 
>more elegant.
> 
> Has anyone had any success in this area?
> 
> Cheers,
> Dan
> http://twitter.com/danklynn


Re: hash uniqueKey generation?

Posted by Lance Norskog <go...@gmail.com>.
I think the deduplication signature field will work as a multiValued 
field. So you can do copyField to it from all of the source fields.

Dan Lynn wrote:
> Hi,
>
> I just finished reading on the wiki about deduplication and the 
> solr.UUIDField type. What I'd like to do is generate an ID for a 
> document by hashing a subset of its fields. One route I thought would 
> be to do this ahead of time to CSV data, but I would think sticking 
> something into the UpdateRequest chain would be more elegant.
>
> Has anyone had any success in this area?
>
> Cheers,
> Dan
> http://twitter.com/danklynn