You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Robertson, Eric J" <Er...@gd-ms.com> on 2018/10/25 22:07:20 UTC

Solr Cell Input Parameter tika.config

Hello all,

Currently trying to define a tika config to use when posting a pdf to Solr Cell as we may want to override the default tika configuration depending on type of document being ingested.

In the docs it lists tika.config as an input parameter to the Solr Cell endpoint. Though in my tests it does not seem to be working or acknowledging it all.

Does anyone have working example using this input parameter?

I am running solr 7.4.0 on Windows 7.

Thanks!

Re: Solr Cell Input Parameter tika.config

Posted by Jan Høydahl <ja...@cominvent.com>.
The tika.config param is documented here:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler

I notice that the code (https://github.com/apache/lucene-solr/blob/964cc88cee7d62edf03a923e3217809d630af5d5/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java#L65-L77) uses "new File(tikaConfigLoc)" for resolving the tika config file, while it should probably load it through SolrResourceLoader to play nice with Zookeeper

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 26. okt. 2018 kl. 05:10 skrev Yasufumi Mizoguchi <ya...@gmail.com>:
> 
> Hello,
> 
> I could not find the process that parse tika.config parameter from solr
> request.
> Maybe, tika.config parameter can only be defined in solrconfig.xml as
> following.
> 
> <requestHandler name="/update/extract"
>                startup="lazy"
>                class="solr.extraction.ExtractingRequestHandler" >
>  <str name="tika.config">tika-config.xml</str>
>  <lst name="defaults">
>    <str name="lowernames">true</str>
>    <str name="uprefix">ignored_</str>
>    <str name="captureAttr">true</str>
>    <str name="fmap.a">links</str>
>    <str name="fmap.div">ignored_</str>
>  </lst>
> </requestHandler>
> 
> Thanks,
> Yasufumi
> 
> 2018年10月26日(金) 7:07 Robertson, Eric J <Er...@gd-ms.com>:
> 
>> Hello all,
>> 
>> Currently trying to define a tika config to use when posting a pdf to Solr
>> Cell as we may want to override the default tika configuration depending on
>> type of document being ingested.
>> 
>> In the docs it lists tika.config as an input parameter to the Solr Cell
>> endpoint. Though in my tests it does not seem to be working or
>> acknowledging it all.
>> 
>> Does anyone have working example using this input parameter?
>> 
>> I am running solr 7.4.0 on Windows 7.
>> 
>> Thanks!
>> 


Re: Solr Cell Input Parameter tika.config

Posted by Yasufumi Mizoguchi <ya...@gmail.com>.
Hello,

I could not find the process that parse tika.config parameter from solr
request.
Maybe, tika.config parameter can only be defined in solrconfig.xml as
following.

<requestHandler name="/update/extract"
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
  <str name="tika.config">tika-config.xml</str>
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

Thanks,
Yasufumi

2018年10月26日(金) 7:07 Robertson, Eric J <Er...@gd-ms.com>:

> Hello all,
>
> Currently trying to define a tika config to use when posting a pdf to Solr
> Cell as we may want to override the default tika configuration depending on
> type of document being ingested.
>
> In the docs it lists tika.config as an input parameter to the Solr Cell
> endpoint. Though in my tests it does not seem to be working or
> acknowledging it all.
>
> Does anyone have working example using this input parameter?
>
> I am running solr 7.4.0 on Windows 7.
>
> Thanks!
>