You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nickolay41189 <Kl...@yandex.ru> on 2015/12/02 21:00:57 UTC

Grouping by simhash signature

I try to implement NearDup detection by  SimHash
<https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr. 
Let's say:
1) each document has a field /simhash_signature/ that stores a sequence of
bits.
2) that in order to be considered NearDup, documents must have, at most, 2
bits that differ in /simhash_signature/


*My question:*
How can I get groups of nearDup by /simhash_signature/?

*Examples:*
  Input:
    Doc A = 0001000
    Doc B = 1000000
    Doc C = 1111111
    Doc D = 0101000
  Output:
    A -> {B, D}
    B -> {A}
    C -> {}
    D -> {A}



--
View this message in context: http://lucene.472066.n3.nabble.com/Grouping-by-simhash-signature-tp4243236.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouping by simhash signature

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2015-12-02 at 13:00 -0700, Nickolay41189 wrote:
> I try to implement NearDup detection by  SimHash
> <https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr. 
[...]
> How can I get groups of nearDup by /simhash_signature/?

You could follow the suggested recipe at the page you linked to and
remove the false positives as part of post-processing? Unless you have a
lot of documents that are at the edge between not-similar-enough and
similar-enough, that should be efficient.


So if a SimHash consists of 4*16 bits: ABCD, you would store all
possible 2-part representations: [AB, AC, AD, BA, BC, BD, CA, CB, CD,
DA, DB, DC], either as String-binary (0/1) for easy debug or a bit more
packed with base 16 or 64.

At query time you would do the same permutations and issue a search for
ab OR ac OR ad OR ba OR bc OR bd OR ca OR cb OR cd OR da OR db OR dc

It would even sorta-work with relevance ranking as a match on 2/4 parts
of the SimHash would mean that 2/12 of the query clauses matches, while
a match on 3/4 SimHash-parts means that 6/12 query clauses matches.

- Toke Eskildsen, State and University Library, Denmark



Re: Grouping by simhash signature

Posted by Nikola Smolenski <sm...@unilib.rs>.
On Wed, Dec 2, 2015 at 9:00 PM, Nickolay41189 <Kl...@yandex.ru> wrote:
> I try to implement NearDup detection by  SimHash
> <https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr.
> Let's say:
> 1) each document has a field /simhash_signature/ that stores a sequence of
> bits.
> 2) that in order to be considered NearDup, documents must have, at most, 2
> bits that differ in /simhash_signature/
>
>
> *My question:*
> How can I get groups of nearDup by /simhash_signature/?
>
> *Examples:*
>   Input:
>     Doc A = 0001000
>     Doc B = 1000000
>     Doc C = 1111111
>     Doc D = 0101000
>   Output:
>     A -> {B, D}
>     B -> {A}
>     C -> {}
>     D -> {A}

I'm not sure if this is the best solution (or, indeed, if it is at all
possible), but maybe you could store the bit fields as strings, then
use strdist function to find Levenshtein distance between the strings
and group by that.

-- 
Nikola Smolenski

University of Belgrade
University library ''Svetozar Markovic''

Re: Grouping by simhash signature

Posted by Nickolay41189 <Kl...@yandex.ru>.
Maybe there is some way to override equals function of grouping (change "=="
to strdist)?




--
View this message in context: http://lucene.472066.n3.nabble.com/Grouping-by-simhash-signature-tp4243236p4244541.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouping by simhash signature

Posted by Chris Hostetter <ho...@fucit.org>.
: I try to implement NearDup detection by  SimHash

I'm not really familiar with simhash, but based on your description of it, 
i'm not sure that any of Solr's deduplication, grouping, or collapsing 
features will really help you here...

: 1) each document has a field /simhash_signature/ that stores a sequence of
: bits.
: 2) that in order to be considered NearDup, documents must have, at most, 2
: bits that differ in /simhash_signature/
: 
: *My question:*
: How can I get groups of nearDup by /simhash_signature/?

the problem here is that there is no transative property in your 
definition of a "NearDup" -- as you point out in your example, B & D are 
both "NearDups" or A, but B & D are not NearDups of eachother.

Some sort of transative relationship (either in terms of an identical 
field value, or a function that can produce identical results for all 
documents i na group) is neccessary to use Solr's de-duplication, 
collapsing, or grouping functionality.

Assuming you wanted results like those below, and you had some existing 
"query + sort" that would identiy the "main" document result set (the "Doc 
A', "Doc B", "Doc C", "Doc D" list in that order) you could -- in theory 
-- write a custom DocTransformer that could annotate those documents with 
a list of doc IDs that had "NearDup" values for some field (possily doing 
strdist, or some other more efficient binary bit set diff as a 
ValueSource) 

If you wanted to pursue implementing a DocTransofrmer like this as a 
plugin, the existing ChildDocTransformerFactory might be a good starting 
point for some code to study.

: *Examples:*
:   Input:
:     Doc A = 0001000
:     Doc B = 1000000
:     Doc C = 1111111
:     Doc D = 0101000
:   Output:
:     A -> {B, D}
:     B -> {A}
:     C -> {}
:     D -> {A}


-Hoss
http://www.lucidworks.com/