You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/08/01 16:46:00 UTC

[jira] [Comment Edited] (TIKA-2434) Language detection slow, cpu intensive, CLI interrupts work

    [ https://issues.apache.org/jira/browse/TIKA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109266#comment-16109266 ] 

Tim Allison edited comment on TIKA-2434 at 8/1/17 4:45 PM:
-----------------------------------------------------------

1) Great!  [~chrismattmann], recommendations for adding headless to the brew script?  Can anyone see any fall-out from running tika in headless mode?  I should probably run tika headless against our regression corpus to see if there are any diffs.

2) In TIKA-2374, [~gagravarr] requested that this be added for -z option.  However, I thought it would be bizarre for a user to be able to extract all images, but then not get text via OCR on those images.  [~gagravarr], should I back-off and do just this: extract inline images only for -z but not for text extraction?  Or, should we leave this as is?  

So that I understand, you want to run OCR on regular "attachment" images inside PDFs but not on their inline images?


was (Author: tallison@mitre.org):
1) Great!  [~chrismattmann], recommendations for adding headless to the brew script?  Can anyone see any fall-out from running tika in headless mode?  I should probably run tika headless against our regression corpus to see if there are any diffs.

2) In TIKA-2374, [~gagravarr] requested that this be added for -z option.  However, I thought it would be bizarre for a user to be able to extract all images, but then not get text via OCR on those images.  [~gagravarr], should I back-off and do just this: extract inline images only for -z but not for text extraction?  Or, should we leave this as is?  

So that I understand, you want to run OCR on the PDFs but not on their inline images?

> Language detection slow, cpu intensive, CLI interrupts work
> -----------------------------------------------------------
>
>                 Key: TIKA-2434
>                 URL: https://issues.apache.org/jira/browse/TIKA-2434
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.16
>         Environment: OS X 10.11.6, JRE 1.8.0_25
>            Reporter: Stefan Karner
>
> Since version 1.16, when using tika -l FILE, it takes a lot longer than e.g. 1.15.
> Also, when batch processing a bunch of files in the background, the Java runtime icon pops up when processing the next file, stealing the input focus from whatever other application I'm currently working on, thus constantly interrupting my work.
> Also, the Java runtime uses from 100% to 400% CPU when executing Tika.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)