You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by arnaud gaudinat <ar...@gmail.com> on 2011/01/14 14:15:53 UTC
Is deduplication possible during Tika extract?
Hello,
here is an excerpt of my solrconfig.xml:
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
startup="lazy">
<lst name="defaults">
<str name="update.processor">dedupe</str>
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
and
<updateRequestProcessorChain name="dedupe">
<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">text</str>
<str
name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
deduplication works when I use only "/update" but not when solr does an
extract with Tika!
Is deduplication possible during Tika extract?
Thanks in advance,
Arno
Re: Is deduplication possible during Tika extract?
Posted by Markus Jelsma <ma...@openindex.io>.
In my opinion it should work for every update handler. If you're really sure
your configuration if fine and it still doesn't work you might have to file an
issue.
Your configuration looks alright but don't forget you've configured
overwriteDupes=false!
> Hello,
>
> here is an excerpt of my solrconfig.xml:
>
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> <lst name="defaults">
>
> <str name="update.processor">dedupe</str>
>
> <!-- All the main content goes into "text"... if you need to return
> the extracted text or do highlighting, use a stored field. -->
> <str name="fmap.content">text</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
>
> <!-- capture link hrefs but ignore div attributes -->
> <str name="captureAttr">true</str>
> <str name="fmap.a">links</str>
> <str name="fmap.div">ignored_</str>
> </lst>
> </requestHandler>
>
> and
>
> <updateRequestProcessorChain name="dedupe">
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">text</str>
> <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> deduplication works when I use only "/update" but not when solr does an
> extract with Tika!
> Is deduplication possible during Tika extract?
>
> Thanks in advance,
> Arno