You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Neeb <mu...@hotmail.com> on 2010/06/08 19:45:35 UTC

Re: Filtering near-duplicates using TextProfileSignature

Hey Andrew,

Just wondering if you ever managed to run TextProfileSignature based
deduplication. I would appreciate it if you could send me the code fragment
for it from  solrconfig.

I have currently something like this, but not sure if I am doing it right:

 <updateRequestProcessorChain name="dedupe">
    <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <str name="signatureField">signature</str>
      <bool name="overwriteDupes">true</bool>
      <str name="fields">title,author,abstract</str>
      <str
name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
      <str name="minTokenLen">3</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain> 

--

Thanks in advance,
-Ali
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880044.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Neeb <mu...@hotmail.com>.
Thanks guys.
I will try this with some test documents, fingers crossed.
And by the way, I got the minTokenLen parameter from one of the thread
replies (from Erik).

Cheerz,
Ali


-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Markus Jelsma <ma...@buyways.nl>.
Here's my config for the updateProcessor. It not uses another signature method 
but i've used TextProfileSignature as well and it works - sort of.


  <updateRequestProcessorChain name="dedupe">
    <processor 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <str name="signatureField">sig</str>
      <bool name="overwriteDupes">true</bool>
      <str name="fields">content</str>
      <str 
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>


Of course, you must define the updateProcessor in your requestHandler, it's 
commented out in mine at the moment.


  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<!--
   <lst name="defaults">
    <str name="update.processor">dedupe</str>
   </lst>
-->
  </requestHandler>


Also, i see you define minTokenLen = 3. Where does that come from? Haven't 
seen anything on the wiki specifying such a parameter.


On Tuesday 08 June 2010 19:45:35 Neeb wrote:
> Hey Andrew,
> 
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code fragment
> for it from  solrconfig.
> 
> I have currently something like this, but not sure if I am doing it right:
> 
>  <updateRequestProcessorChain name="dedupe">
>     <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       <bool name="enabled">true</bool>
>       <str name="signatureField">signature</str>
>       <bool name="overwriteDupes">true</bool>
>       <str name="fields">title,author,abstract</str>
>       <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> <str name="minTokenLen">3</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
> 
> --
> 
> Thanks in advance,
> -Ali
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Markus Jelsma wrote:
> 
> Well, it got me too! KMail didn't properly order this thread. Can't seem
> to 
> find Hatcher's reply anywhere. ??!!?
> 

Whole thread here:

http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Markus Jelsma <ma...@buyways.nl>.
Well, it got me too! KMail didn't properly order this thread. Can't seem to 
find Hatcher's reply anywhere. ??!!?


On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote:
> Andrew Clegg wrote:
> > Re. your config, I don't see a minTokenLength in the wiki page for
> > deduplication, is this a recent addition that's not documented yet?
> 
> Sorry about this -- stupid question -- I should have read back through the
> thread and refreshed my memory.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Andrew Clegg wrote:
> 
> Re. your config, I don't see a minTokenLength in the wiki page for
> deduplication, is this a recent addition that's not documented yet?
> 

Sorry about this -- stupid question -- I should have read back through the
thread and refreshed my memory.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

Posted by Andrew Clegg <an...@gmail.com>.

Neeb wrote:
> 
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code
> fragment for it from  solrconfig.
> 

Actually the project that was for got postponed and I got distracted by
other things, for now at least.

Re. your config, I don't see a minTokenLength in the wiki page for
deduplication, is this a recent addition that's not documented yet?

It looks okay to me though -- perhaps you could do some empirical tests to
see if it's working? i.e. add some near-dupes to a collection manually and
see if it finds them?

Andrew.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880379.html
Sent from the Solr - User mailing list archive at Nabble.com.