You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Paul Borgermans <pa...@gmail.com> on 2022/07/13 15:32:33 UTC

Tika 2.4.x how to configure the scientific parsers

Hi

I am a bit struggling with Tika 2.4 server and activating the scientific
parsers that are now in a separate module. So far I have not found a clear
example or instructions, so here is my progress making it work more or less
in a ubuntu 20 environment ( I am no Java expert):

- install external dependencies (gdal and co)
- create a tika-config.xml file which specifies some individual parsers
found in the scientific module (see also below)
- start the server with the jars as classpath arguments and call the main
tika class:
$ java -cp
"tika-server-standard-2.4.1.jar:tika-parser-scientific-package-2.4.1.jar" \
org.apache.tika.server.core.TikaServerCli -h '*' -c tika-config.xml

My questions:
- is this the best approach to get the scientific parsers activated? Can it
be done for all included parsers in one go?
- It looks that GDAL (which also parses image formats) is
de-activating tesseract for some image formats. Is there a way to undo
this? Or specify the order of parsers?

Thanks!
Paul

========tika-config.xml ===============

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <!--for example: <mimeTypeRepository
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
  <service-loader dynamic="true" loadErrorHandler="WARN"/>
  <encodingDetectors>
    <encodingDetector
class="org.apache.tika.detect.DefaultEncodingDetector"/>
  </encodingDetectors>
  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultDetector"/>
  </detectors>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.netcdf.NetCDFParser"/>
    <parser class="org.apache.tika.parser.gdal.GDALParser"/>
</parsers>
</properties>

============================

Re: Tika 2.4.x how to configure the scientific parsers

Posted by Tim Allison <ta...@apache.org>.
That looks right to me.  The scientific parsers will be automatically added
to the parser list and you shouldn't have to configure them.

The GDALParser does preempt parsing of image files that it covers, and they
won't also be parsed by TesseractOCRParser.  If you'd like Tesseract, turn
off GDAL via the parsers section or limit which file types GDAL handles by
decorating it with supported mime-types.

On https://issues.apache.org/jira/browse/TIKA-3812, we document how the
ordering changed (was fixed) between 2.4.0 and 2.4.1 if that has any
relevance.

If you need specifics on any of the above, please let us know.

Best,

      Tim

On Wed, Jul 13, 2022 at 11:32 AM Paul Borgermans <pa...@gmail.com>
wrote:

> Hi
>
> I am a bit struggling with Tika 2.4 server and activating the scientific
> parsers that are now in a separate module. So far I have not found a clear
> example or instructions, so here is my progress making it work more or less
> in a ubuntu 20 environment ( I am no Java expert):
>
> - install external dependencies (gdal and co)
> - create a tika-config.xml file which specifies some individual parsers
> found in the scientific module (see also below)
> - start the server with the jars as classpath arguments and call the main
> tika class:
> $ java -cp
> "tika-server-standard-2.4.1.jar:tika-parser-scientific-package-2.4.1.jar" \
> org.apache.tika.server.core.TikaServerCli -h '*' -c tika-config.xml
>
> My questions:
> - is this the best approach to get the scientific parsers activated? Can
> it be done for all included parsers in one go?
> - It looks that GDAL (which also parses image formats) is
> de-activating tesseract for some image formats. Is there a way to undo
> this? Or specify the order of parsers?
>
> Thanks!
> Paul
>
> ========tika-config.xml ===============
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
>   <!--for example: <mimeTypeRepository
> resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
>   <service-loader dynamic="true" loadErrorHandler="WARN"/>
>   <encodingDetectors>
>     <encodingDetector
> class="org.apache.tika.detect.DefaultEncodingDetector"/>
>   </encodingDetectors>
>   <translator
> class="org.apache.tika.language.translate.DefaultTranslator"/>
>   <detectors>
>     <detector class="org.apache.tika.detect.DefaultDetector"/>
>   </detectors>
>   <parsers>
>     <parser class="org.apache.tika.parser.DefaultParser"/>
>     <parser class="org.apache.tika.parser.netcdf.NetCDFParser"/>
>     <parser class="org.apache.tika.parser.gdal.GDALParser"/>
> </parsers>
> </properties>
>
> ============================
>
>
>
>
>