You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/03/02 03:21:04 UTC
[jira] [Closed] (TIKA-836) parsing really slow on some documents
[ https://issues.apache.org/jira/browse/TIKA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tyler Palsulich closed TIKA-836.
--------------------------------
Resolution: Cannot Reproduce
We can't reproduce this without the problem files. If you still have them, please upload them and reopen!
> parsing really slow on some documents
> -------------------------------------
>
> Key: TIKA-836
> URL: https://issues.apache.org/jira/browse/TIKA-836
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Environment: CentOS 4.x/5.x/6.x
> Reporter: Rob Tulloh
>
> We are seeing that tika sometimes takes a very long time to parse some content (likely PDF). For example, with the following EML file that contains 4 documents (2 PDF, 1 MS Excel, 1 text):
> {noformat}
> fgrep --binary-file=text Content-Type: XXX.eml
> Content-Type: multipart/mixed;
> Content-Type: multipart/alternative;
> Content-Type: text/plain;
> Content-Type: text/html;
> Content-Type: application/octet-stream;
> Content-Type: application/octet-stream;
> Content-Type: application/vnd.ms-excel;
> du -sh XXX.eml
> 6.0M XXX.eml
> {noformat}
> Note that it takes tika nearly 30 minutes to process this content even though the source is only 6M in size:
> {noformat}
> time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out
> WARN - Did not found XRef object at specified startxref position 230521
> WARN - Did not found XRef object at specified startxref position 3742379
> real 29m16.913s
> user 18m17.050s
> sys 0m19.465s
> {noformat}
> Is there any way to configure tika (in particular via solr) to process files more quickly?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)