You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "alessandro.rieti@virgilio.it" <al...@virgilio.it> on 2010/12/05 16:42:01 UTC
ExtractingRequestHandler configuration
Hi All,
I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default configuration that I found on the wiki (http://wiki.apache.org/solr/ExtractingRequestHandler).
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<!--<str name="tika.config">/my/path/to/tika.config</str>-->
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
<!--
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
-->
</requestHandler>
now when I injest via solrj api the html and pdf document I can find in the solr indexes document like that:
stored/uncompressed,indexed,tokenized<Content-Type:application/pdf>
stored/uncompressed,indexed,omitNorms<PID:eims-document:25445#objects/eims-document:226946/datastreams/PDF/content>
stored/uncompressed,indexed,tokenized<content: stream_size 1168557 Content-Type application/pdf >
stored/uncompressed,indexed,tokenized<stream_size:1168557>
stored/uncompressed,indexed,omitNorms<timestamp:2010-12-05T12:34:44.423>
How can I add the configuration to strip the PDF/HTML content and add it to the content field?
In order to update the a document in the index, Is it possible to inject multiple binary object with the same pid?
Regards
Alessandro