You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by arnaud gaudinat <ar...@gmail.com> on 2011/01/14 14:15:53 UTC

Is deduplication possible during Tika extract?

Hello,

here is an excerpt of my solrconfig.xml:

<requestHandler name="/update/extract" 
class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
startup="lazy">
<lst name="defaults">

<str name="update.processor">dedupe</str>

<!-- All the main content goes into "text"... if you need to return
            the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>

<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

and

<updateRequestProcessorChain name="dedupe">
<processor 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">text</str>
<str 
name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

deduplication works when I use only "/update" but not when solr does an 
extract with Tika!
Is deduplication possible during Tika extract?

Thanks in advance,
Arno

Re: Is deduplication possible during Tika extract?

Posted by Markus Jelsma <ma...@openindex.io>.

In my opinion it should work for every update handler. If you're really sure 
your configuration if fine and it still doesn't work you might have to file an 
issue.

Your configuration looks alright but don't forget you've configured 
overwriteDupes=false!

> Hello,
> 
> here is an excerpt of my solrconfig.xml:
> 
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> <lst name="defaults">
> 
> <str name="update.processor">dedupe</str>
> 
> <!-- All the main content goes into "text"... if you need to return
>             the extracted text or do highlighting, use a stored field. -->
> <str name="fmap.content">text</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
> 
> <!-- capture link hrefs but ignore div attributes -->
> <str name="captureAttr">true</str>
> <str name="fmap.a">links</str>
> <str name="fmap.div">ignored_</str>
> </lst>
> </requestHandler>
> 
> and
> 
> <updateRequestProcessorChain name="dedupe">
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">text</str>
> <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> deduplication works when I use only "/update" but not when solr does an
> extract with Tika!
> Is deduplication possible during Tika extract?
> 
> Thanks in advance,
> Arno