You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/03 16:41:00 UTC
[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

    [ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874 ] 

Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:40 PM:
------------------------------------------------------------

Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser.  With debugging and custom logging, I could see that even running multi-threaded, the code works as expected.  If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted.

2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr.  I couldn't find any problems.  The TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  I ran pidstat against the forked process, not the primary watcher process.  The results all basically look the same to me. 

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic () 	03/03/2022 	_x86_64_	(8 CPU)

11:31:47 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:31:47 AM  1000    254595    0.16    0.00    0.00    0.00    0.17     2  java

11:31:47 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:31:47 AM  1000    254595    442080     11820         0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic () 	03/03/2022 	_x86_64_	(8 CPU)

11:08:39 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:08:39 AM  1000    250033    0.16    0.00    0.00    0.00    0.17     5  java

11:08:39 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:08:39 AM  1000    250033    439390     11780         0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic () 	03/03/2022 	_x86_64_	(8 CPU)

11:16:50 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:16:50 AM  1000    252228    0.16    0.00    0.00    0.00    0.17     5  java

11:16:50 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:16:50 AM  1000    252228    437250     12380         0  java
{noformat}


was (Author: tallison@mitre.org):
Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser.  With debugging and custom logging, I could see that even running multi-threaded, the code works as expected.  If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted.

2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr.  I couldn't find any problems.  The TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  The results all basically look the same to me.  The f

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic () 	03/03/2022 	_x86_64_	(8 CPU)

11:31:47 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:31:47 AM  1000    254595    0.16    0.00    0.00    0.00    0.17     2  java

11:31:47 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:31:47 AM  1000    254595    442080     11820         0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic () 	03/03/2022 	_x86_64_	(8 CPU)

11:08:39 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:08:39 AM  1000    250033    0.16    0.00    0.00    0.00    0.17     5  java

11:08:39 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:08:39 AM  1000    250033    439390     11780         0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic () 	03/03/2022 	_x86_64_	(8 CPU)

11:16:50 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:16:50 AM  1000    252228    0.16    0.00    0.00    0.00    0.17     5  java

11:16:50 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:16:50 AM  1000    252228    437250     12380         0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> ----------------------------------
>
>                 Key: TIKA-3668
>                 URL: https://issues.apache.org/jira/browse/TIKA-3668
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Manjunath Dhongadi
>            Priority: Major
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)