You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2011/11/01 01:32:26 UTC

Re: Bet you didn't know Lucene can...

On 31/10/2011 21:42, Petite Abeille wrote:
>
> On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:
>
>> similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash.
>>
>> The solution is described in SOLR-1918 - Bit-wise scoring field type.
>
> In other words, a simhash, no?
>
> Similarity Estimation Techniques from Rounding Algorithms
> http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
>
> http://www.matpalm.com/resemblance/simhash/

Yes, you could use this. In that project we used a different 
application-specific hash.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org