You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Clemens Wyss DEV <cl...@mysign.ch> on 2013/12/22 10:07:30 UTC

How can parsing a 5Mb take 3minutes?

I have a 3Mb pdf files (and others) that takes 3 minutes to extract ist content. In my test I am using AutodetectParser (and PDFParser). 
I have built Tika from sources, i.e. am using 1.5 snapshot.

Can anybody explain why/how this is possible?

Where/how can I send the very document? 

Regards
Clemens

Re: How can parsing a 5Mb take 3minutes?

Posted by Jeroen Reijn <j....@onehippo.com>.

Did you check what it was doing by getting a threaddump or using a profiler?


On Sun, Dec 22, 2013 at 3:25 PM, Clemens Wyss DEV <cl...@mysign.ch>wrote:

> Issued a bug https://issues.apache.org/jira/browse/TIKA-1213 allthough
> I'm not sure whether it's abug or me applying the API inappropriately.
>
> Could the newly introduced NonSequentialPDFParser "help"?
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Gesendet: Sonntag, 22. Dezember 2013 10:08
> An: user@tika.apache.org
> Betreff: How can parsing a 5Mb take 3minutes?
>
> I have a 3Mb pdf files (and others) that takes 3 minutes to extract ist
> content. In my test I am using AutodetectParser (and PDFParser).
> I have built Tika from sources, i.e. am using 1.5 snapshot.
>
> Can anybody explain why/how this is possible?
>
> Where/how can I send the very document?
>
> Regards
> Clemens
>



-- 
Jeroen Reijn
Hippo

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 101 Main Street, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

http://about.me/jeroenreijn

AW: How can parsing a 5Mb take 3minutes?

Posted by Clemens Wyss DEV <cl...@mysign.ch>.

Issued a bug https://issues.apache.org/jira/browse/TIKA-1213 allthough I'm not sure whether it's abug or me applying the API inappropriately.

Could the newly introduced NonSequentialPDFParser "help"?

-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch] 
Gesendet: Sonntag, 22. Dezember 2013 10:08
An: user@tika.apache.org
Betreff: How can parsing a 5Mb take 3minutes?

I have a 3Mb pdf files (and others) that takes 3 minutes to extract ist content. In my test I am using AutodetectParser (and PDFParser). 
I have built Tika from sources, i.e. am using 1.5 snapshot.

Can anybody explain why/how this is possible?

Where/how can I send the very document? 

Regards
Clemens