You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Clegg <an...@gmail.com> on 2010/01/12 13:56:08 UTC

Filtering near-duplicates using TextProfileSignature

Hi,

I'm interested in near-dupe removal as mentioned (briefly) here:

http://wiki.apache.org/solr/Deduplication

However the link for TextProfileSignature hasn't been filled in yet.

Does anyone have an example of using TextProfileSignature that demonstrates
the tunable parameters mentioned in the wiki?

Thanks!

Andrew.

-- 
View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27127151.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
>> Thanks Erik, but I'm still a little confused as to exactly where in  
>> the Solr
>> config I set these parameters.
> 
> You'd configure them within the <processor> element, something like  
> this:
> 
>     <str name="minTokenLen">5</str>
> 
> 

OK, thanks. (Should that really be str though, and not int or something?)


Erik Hatcher-4 wrote:
> 
> 
> Perhaps you could update the wiki with an example once you get it  
> working?
> 
> I'm flying a little blind here, just going off the source code, not  
> trying it out for real.
> 
> 

Sure -- it won't be til next week at the earliest though.

Cheers,

Andrew.

-- 
View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128493.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

Posted by Neeb <mu...@hotmail.com>.
Thanks guys.
I will try this with some test documents, fingers crossed.
And by the way, I got the minTokenLen parameter from one of the thread
replies (from Erik).

Cheerz,
Ali


-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Markus Jelsma <ma...@buyways.nl>.
Here's my config for the updateProcessor. It not uses another signature method 
but i've used TextProfileSignature as well and it works - sort of.


  <updateRequestProcessorChain name="dedupe">
    <processor 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <str name="signatureField">sig</str>
      <bool name="overwriteDupes">true</bool>
      <str name="fields">content</str>
      <str 
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>


Of course, you must define the updateProcessor in your requestHandler, it's 
commented out in mine at the moment.


  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<!--
   <lst name="defaults">
    <str name="update.processor">dedupe</str>
   </lst>
-->
  </requestHandler>


Also, i see you define minTokenLen = 3. Where does that come from? Haven't 
seen anything on the wiki specifying such a parameter.


On Tuesday 08 June 2010 19:45:35 Neeb wrote:
> Hey Andrew,
> 
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code fragment
> for it from  solrconfig.
> 
> I have currently something like this, but not sure if I am doing it right:
> 
>  <updateRequestProcessorChain name="dedupe">
>     <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       <bool name="enabled">true</bool>
>       <str name="signatureField">signature</str>
>       <bool name="overwriteDupes">true</bool>
>       <str name="fields">title,author,abstract</str>
>       <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> <str name="minTokenLen">3</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
> 
> --
> 
> Thanks in advance,
> -Ali
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Markus Jelsma wrote:
> 
> Well, it got me too! KMail didn't properly order this thread. Can't seem
> to 
> find Hatcher's reply anywhere. ??!!?
> 

Whole thread here:

http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Markus Jelsma <ma...@buyways.nl>.
Well, it got me too! KMail didn't properly order this thread. Can't seem to 
find Hatcher's reply anywhere. ??!!?


On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote:
> Andrew Clegg wrote:
> > Re. your config, I don't see a minTokenLength in the wiki page for
> > deduplication, is this a recent addition that's not documented yet?
> 
> Sorry about this -- stupid question -- I should have read back through the
> thread and refreshed my memory.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Andrew Clegg wrote:
> 
> Re. your config, I don't see a minTokenLength in the wiki page for
> deduplication, is this a recent addition that's not documented yet?
> 

Sorry about this -- stupid question -- I should have read back through the
thread and refreshed my memory.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Neeb wrote:
> 
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code
> fragment for it from  solrconfig.
> 

Actually the project that was for got postponed and I got distracted by
other things, for now at least.

Re. your config, I don't see a minTokenLength in the wiki page for
deduplication, is this a recent addition that's not documented yet?

It looks okay to me though -- perhaps you could do some empirical tests to
see if it's working? i.e. add some near-dupes to a collection manually and
see if it finds them?

Andrew.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880379.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Neeb <mu...@hotmail.com>.
Hey Andrew,

Just wondering if you ever managed to run TextProfileSignature based
deduplication. I would appreciate it if you could send me the code fragment
for it from  solrconfig.

I have currently something like this, but not sure if I am doing it right:

 <updateRequestProcessorChain name="dedupe">
    <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <str name="signatureField">signature</str>
      <bool name="overwriteDupes">true</bool>
      <str name="fields">title,author,abstract</str>
      <str
name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
      <str name="minTokenLen">3</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain> 

--

Thanks in advance,
-Ali
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880044.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Erik Hatcher <er...@gmail.com>.
On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
> Thanks Erik, but I'm still a little confused as to exactly where in  
> the Solr
> config I set these parameters.

You'd configure them within the <processor> element, something like  
this:

    <str name="minTokenLen">5</str>


> The example on the wiki page uses Lookup3Signature which  
> (presumably) takes
> no parameters, so there's no indication in the XML examples of where  
> you
> would set them.

Right, looking at the source code, Lookup3Signature takes no parameters.

Perhaps you could update the wiki with an example once you get it  
working?

I'm flying a little blind here, just going off the source code, not  
trying it out for real.

	Erik


Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Thanks Erik, but I'm still a little confused as to exactly where in the Solr
config I set these parameters.

The example on the wiki page uses Lookup3Signature which (presumably) takes
no parameters, so there's no indication in the XML examples of where you
would set them. Unless I'm missing something.

Thanks again,

Andrew.


Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
>> I'm interested in near-dupe removal as mentioned (briefly) here:
>>
>> http://wiki.apache.org/solr/Deduplication
>>
>> However the link for TextProfileSignature hasn't been filled in yet.
>>
>> Does anyone have an example of using TextProfileSignature that  
>> demonstrates
>> the tunable parameters mentioned in the wiki?
> 
> There are some comments in the source code*, but they weren't made  
> class-level.  I'm fixing that and committing it now, but here's the  
> comment:
> 
> /**
>   * <p>This implementation is copied from Apache Nutch. </p>
>   * <p>An implementation of a page signature. It calculates an MD5 hash
>   * of a plain text "profile" of a page.</p>
>   * <p>The algorithm to calculate a page "profile" takes the plain  
> text version of
>   * a page and performs the following steps:
>   * <ul>
>   * <li>remove all characters except letters and digits, and bring all  
> characters
>   * to lower case,</li>
>   * <li>split the text into tokens (all consecutive non-whitespace  
> characters),</li>
>   * <li>discard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
> characters),</li>
>   * <li>sort the list of tokens by decreasing frequency,</li>
>   * <li>round down the counts of tokens to the nearest multiple of QUANT
>   * (<code>QUANT = QUANT_RATE * maxFreq</code>, where  
> <code>QUANT_RATE</code> is 0.01f
>   * by default, and <code>maxFreq</code> is the maximum token  
> frequency). If
>   * <code>maxFreq</code> is higher than 1, then QUANT is always higher  
> than 2 (which
>   * means that tokens with frequency 1 are always discarded).</li>
>   * <li>tokens, which frequency after quantization falls below QUANT,  
> are discarded.</li>
>   * <li>create a list of tokens and their quantized frequency,  
> separated by spaces,
>   * in the order of decreasing frequency.</li>
>   * </ul>
>   * This list is then submitted to an MD5 hash calculation.*/
> 
> There are two parameters this implementation takes:
> 
>      quantRate = params.getFloat("quantRate", 0.01f);
>      minTokenLen = params.getInt("minTokenLen", 2);
> 
> Hope that helps.
> 
> 	Erik
> 
> 
> 
> *
> http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128173.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

Posted by Erik Hatcher <er...@gmail.com>.
On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
> I'm interested in near-dupe removal as mentioned (briefly) here:
>
> http://wiki.apache.org/solr/Deduplication
>
> However the link for TextProfileSignature hasn't been filled in yet.
>
> Does anyone have an example of using TextProfileSignature that  
> demonstrates
> the tunable parameters mentioned in the wiki?

There are some comments in the source code*, but they weren't made  
class-level.  I'm fixing that and committing it now, but here's the  
comment:

/**
  * <p>This implementation is copied from Apache Nutch. </p>
  * <p>An implementation of a page signature. It calculates an MD5 hash
  * of a plain text "profile" of a page.</p>
  * <p>The algorithm to calculate a page "profile" takes the plain  
text version of
  * a page and performs the following steps:
  * <ul>
  * <li>remove all characters except letters and digits, and bring all  
characters
  * to lower case,</li>
  * <li>split the text into tokens (all consecutive non-whitespace  
characters),</li>
  * <li>discard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
characters),</li>
  * <li>sort the list of tokens by decreasing frequency,</li>
  * <li>round down the counts of tokens to the nearest multiple of QUANT
  * (<code>QUANT = QUANT_RATE * maxFreq</code>, where  
<code>QUANT_RATE</code> is 0.01f
  * by default, and <code>maxFreq</code> is the maximum token  
frequency). If
  * <code>maxFreq</code> is higher than 1, then QUANT is always higher  
than 2 (which
  * means that tokens with frequency 1 are always discarded).</li>
  * <li>tokens, which frequency after quantization falls below QUANT,  
are discarded.</li>
  * <li>create a list of tokens and their quantized frequency,  
separated by spaces,
  * in the order of decreasing frequency.</li>
  * </ul>
  * This list is then submitted to an MD5 hash calculation.*/

There are two parameters this implementation takes:

     quantRate = params.getFloat("quantRate", 0.01f);
     minTokenLen = params.getInt("minTokenLen", 2);

Hope that helps.

	Erik



* http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java