You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by aravinth thangasami <ar...@gmail.com> on 2017/08/04 08:42:30 UTC

Performance Improvement AutoDetectParser

Hi all,

we are using Tika 1.13.

While instantiating AutoDetectParser we found that the
CompositeExternalParser which actually we don't need, takes up more time.
It because of  ExifTool & FFmpeg.

I tried with removing CompositeExternalParser from Jar and we are seeing an
Improvement.

With TikaConfig I can't achieve that performance by excluding classes from
AutoDetect Parser.

Is there any other way to it without removing it from Jar?



Thanks
Aravinth

Re: Performance Improvement AutoDetectParser

Posted by aravinth thangasami <ar...@gmail.com>.
I tried instantiating using Tika config already  but it still makes call to
ExternalParsers
my configs are

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <!-- Default Parser for most things, except for 2 mime types, and never
             use the Executable Parser -->
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude
class="org.apache.tika.parser.external.CompositeExternalParser"/>
            <parser-exclude
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>

</properties>



import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import java.io.FileInputStream;
import java.io.FileOutputStream;

public class Autodetect
{
    public static void main(String[] args) throws Exception
    {
        long start=System.currentTimeMillis();
        TikaConfig config = new TikaConfig("tika-config.xml");
        AutoDetectParser parser=new AutoDetectParser(config);
        System.out.println("Time for init
"+(System.currentTimeMillis() - start));
        FileInputStream is=new FileInputStream("test.zip");
        FileOutputStream os=new FileOutputStream("out.txt");
        BodyContentHandler contentHandler = new BodyContentHandler(os);
        Metadata metadata=new Metadata();
        ParseContext parseContext=new ParseContext();
        parseContext.set(Parser.class,parser);
        parser.parse(is,contentHandler,metadata,parseContext);

    }

}



On Fri, Aug 4, 2017 at 2:31 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Fri, 4 Aug 2017, aravinth thangasami wrote:
>
>> we are using Tika 1.13.
>>
>
> 1.15 is out!
>
> While instantiating AutoDetectParser we found that the
>> CompositeExternalParser which actually we don't need, takes up more time.
>> It because of  ExifTool & FFmpeg.
>>
>> I tried with removing CompositeExternalParser from Jar and we are seeing
>> an
>> Improvement.
>>
>
> You should be able to exclude that from DefaultParser in config with a
> parser-exclude:
> http://tika.apache.org/1.16/configuring.html#Configuring_Parsers
>
> Then make sure you create your AutoDetectParser from the config with that
> exclude
>
> Nick
>

Re: Performance Improvement AutoDetectParser

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 4 Aug 2017, aravinth thangasami wrote:
> we are using Tika 1.13.

1.15 is out!

> While instantiating AutoDetectParser we found that the
> CompositeExternalParser which actually we don't need, takes up more time.
> It because of  ExifTool & FFmpeg.
>
> I tried with removing CompositeExternalParser from Jar and we are seeing an
> Improvement.

You should be able to exclude that from DefaultParser in config with a 
parser-exclude:
http://tika.apache.org/1.16/configuring.html#Configuring_Parsers

Then make sure you create your AutoDetectParser from the config with that 
exclude

Nick