You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Neeb <mu...@hotmail.com> on 2010/06/08 19:45:35 UTC
Re: Filtering near-duplicates using TextProfileSignature
Hey Andrew,
Just wondering if you ever managed to run TextProfileSignature based
deduplication. I would appreciate it if you could send me the code fragment
for it from solrconfig.
I have currently something like this, but not sure if I am doing it right:
<updateRequestProcessorChain name="dedupe">
<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">title,author,abstract</str>
<str
name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
<str name="minTokenLen">3</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
--
Thanks in advance,
-Ali
--
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880044.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Posted by Neeb <mu...@hotmail.com>.
Thanks guys.
I will try this with some test documents, fingers crossed.
And by the way, I got the minTokenLen parameter from one of the thread
replies (from Erik).
Cheerz,
Ali
--
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Posted by Markus Jelsma <ma...@buyways.nl>.
Here's my config for the updateProcessor. It not uses another signature method
but i've used TextProfileSignature as well and it works - sort of.
<updateRequestProcessorChain name="dedupe">
<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">sig</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">content</str>
<str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Of course, you must define the updateProcessor in your requestHandler, it's
commented out in mine at the moment.
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<!--
<lst name="defaults">
<str name="update.processor">dedupe</str>
</lst>
-->
</requestHandler>
Also, i see you define minTokenLen = 3. Where does that come from? Haven't
seen anything on the wiki specifying such a parameter.
On Tuesday 08 June 2010 19:45:35 Neeb wrote:
> Hey Andrew,
>
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code fragment
> for it from solrconfig.
>
> I have currently something like this, but not sure if I am doing it right:
>
> <updateRequestProcessorChain name="dedupe">
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">true</bool>
> <str name="fields">title,author,abstract</str>
> <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> <str name="minTokenLen">3</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> --
>
> Thanks in advance,
> -Ali
>
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Filtering near-duplicates using TextProfileSignature
Posted by Andrew Clegg <an...@gmail.com>.
Markus Jelsma wrote:
>
> Well, it got me too! KMail didn't properly order this thread. Can't seem
> to
> find Hatcher's reply anywhere. ??!!?
>
Whole thread here:
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
--
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Posted by Markus Jelsma <ma...@buyways.nl>.
Well, it got me too! KMail didn't properly order this thread. Can't seem to
find Hatcher's reply anywhere. ??!!?
On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote:
> Andrew Clegg wrote:
> > Re. your config, I don't see a minTokenLength in the wiki page for
> > deduplication, is this a recent addition that's not documented yet?
>
> Sorry about this -- stupid question -- I should have read back through the
> thread and refreshed my memory.
>
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Filtering near-duplicates using TextProfileSignature
Posted by Andrew Clegg <an...@gmail.com>.
Andrew Clegg wrote:
>
> Re. your config, I don't see a minTokenLength in the wiki page for
> deduplication, is this a recent addition that's not documented yet?
>
Sorry about this -- stupid question -- I should have read back through the
thread and refreshed my memory.
--
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Posted by Andrew Clegg <an...@gmail.com>.
Neeb wrote:
>
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code
> fragment for it from solrconfig.
>
Actually the project that was for got postponed and I got distracted by
other things, for now at least.
Re. your config, I don't see a minTokenLength in the wiki page for
deduplication, is this a recent addition that's not documented yet?
It looks okay to me though -- perhaps you could do some empirical tests to
see if it's working? i.e. add some near-dupes to a collection manually and
see if it finds them?
Andrew.
--
View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880379.html
Sent from the Solr - User mailing list archive at Nabble.com.