You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by aravinth thangasami <ar...@gmail.com> on 2017/08/04 08:42:30 UTC
Performance Improvement AutoDetectParser
Hi all,
we are using Tika 1.13.
While instantiating AutoDetectParser we found that the
CompositeExternalParser which actually we don't need, takes up more time.
It because of ExifTool & FFmpeg.
I tried with removing CompositeExternalParser from Jar and we are seeing an
Improvement.
With TikaConfig I can't achieve that performance by excluding classes from
AutoDetect Parser.
Is there any other way to it without removing it from Jar?
Thanks
Aravinth
Re: Performance Improvement AutoDetectParser
Posted by aravinth thangasami <ar...@gmail.com>.
I tried instantiating using Tika config already but it still makes call to
ExternalParsers
my configs are
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Default Parser for most things, except for 2 mime types, and never
use the Executable Parser -->
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude
class="org.apache.tika.parser.external.CompositeExternalParser"/>
<parser-exclude
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class Autodetect
{
public static void main(String[] args) throws Exception
{
long start=System.currentTimeMillis();
TikaConfig config = new TikaConfig("tika-config.xml");
AutoDetectParser parser=new AutoDetectParser(config);
System.out.println("Time for init
"+(System.currentTimeMillis() - start));
FileInputStream is=new FileInputStream("test.zip");
FileOutputStream os=new FileOutputStream("out.txt");
BodyContentHandler contentHandler = new BodyContentHandler(os);
Metadata metadata=new Metadata();
ParseContext parseContext=new ParseContext();
parseContext.set(Parser.class,parser);
parser.parse(is,contentHandler,metadata,parseContext);
}
}
On Fri, Aug 4, 2017 at 2:31 PM, Nick Burch <ap...@gagravarr.org> wrote:
> On Fri, 4 Aug 2017, aravinth thangasami wrote:
>
>> we are using Tika 1.13.
>>
>
> 1.15 is out!
>
> While instantiating AutoDetectParser we found that the
>> CompositeExternalParser which actually we don't need, takes up more time.
>> It because of ExifTool & FFmpeg.
>>
>> I tried with removing CompositeExternalParser from Jar and we are seeing
>> an
>> Improvement.
>>
>
> You should be able to exclude that from DefaultParser in config with a
> parser-exclude:
> http://tika.apache.org/1.16/configuring.html#Configuring_Parsers
>
> Then make sure you create your AutoDetectParser from the config with that
> exclude
>
> Nick
>
Re: Performance Improvement AutoDetectParser
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 4 Aug 2017, aravinth thangasami wrote:
> we are using Tika 1.13.
1.15 is out!
> While instantiating AutoDetectParser we found that the
> CompositeExternalParser which actually we don't need, takes up more time.
> It because of ExifTool & FFmpeg.
>
> I tried with removing CompositeExternalParser from Jar and we are seeing an
> Improvement.
You should be able to exclude that from DefaultParser in config with a
parser-exclude:
http://tika.apache.org/1.16/configuring.html#Configuring_Parsers
Then make sure you create your AutoDetectParser from the config with that
exclude
Nick