You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "alessandro.rieti@virgilio.it" <al...@virgilio.it> on 2010/12/05 16:42:01 UTC

ExtractingRequestHandler configuration

 Hi All,
I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default configuration that I found on the wiki (http://wiki.apache.org/solr/ExtractingRequestHandler).

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
    <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for details.-->
    <!--<str name="tika.config">/my/path/to/tika.config</str>-->

    <!-- Optional. Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
<!--
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
-->
  </requestHandler>

now when I injest via solrj api the html and pdf document I can find in the solr indexes document like that:


stored/uncompressed,indexed,tokenized<Content-Type:application/pdf>
stored/uncompressed,indexed,omitNorms<PID:eims-document:25445#objects/eims-document:226946/datastreams/PDF/content>

stored/uncompressed,indexed,tokenized<content:  stream_size 1168557   Content-Type application/pdf         >
stored/uncompressed,indexed,tokenized<stream_size:1168557>
stored/uncompressed,indexed,omitNorms<timestamp:2010-12-05T12:34:44.423>


How can I add the configuration to strip the PDF/HTML content  and add it to the content field?
In order to update the a document in the index, Is it possible to inject multiple binary object with the same pid? 

Regards
Alessandro