You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/07/25 13:44:39 UTC
[jira] [Comment Edited] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

    [ https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074306#comment-14074306 ] 

Tim Allison edited comment on TIKA-1375 at 7/25/14 11:43 AM:
-------------------------------------------------------------

I ran four versions of Tika against a random selection of 10k pdfs from govdocs1 to make sure that there wouldn't be any surprises if we added the three calls to clear().  These were all single-threaded runs on an 8GB linux vm.

The first run was the most recent SNAPSHOT (1.6 Baseline). The second run was after the three calls to clear() were added, but because embedded images are by default not extracted, the only one that was ever actually called was the one at the end of each page.  The third run was the 1.6 SNAPSHOT/Baseline with image extraction turned on, and the fourth was the clear() version with image extraction turned on.  

There were the same number of exceptions across all versions.  Within the "without image extraction" pairs, the number of metadata elements was exactly the same, and within the "with image extraction" pairs, the number of metadata elements was exactly the same. 

Adding .clear() improved speed when not extracting images and decreased speed by a much smaller amount (percentage-wise) when extracting images.

||Run||Average Millis||Median Millis||
|Tika 1.6 Baseline|272.5|113.0|
|Tika 1.6 Page.clear()|243.6|85.0|
|Tika 1.6 Baseline Image Extraction|861.5|120.5|
|Tika 1.6 Image Extraction w/ 3x clear()|888.2|124.0|



was (Author: tallison@mitre.org):
I ran four versions of Tika against a random selection of 10k pdfs from govdocs1 to make sure that there wouldn't be any surprises if we added the three calls to clear().  These were all single-threaded runs on an 8GB linux vm.

The first run was the most recent SNAPSHOT (1.6 Baseline). The second run was after the three calls to clear() were added, but because embedded images are by default not extracted, the only one that was ever actually called was the one at the end of each page.  The third run was the 1.6 SNAPSHOT/Baseline with image extraction turned on, and the fourth was the clear() version with image extraction turned on.  

There were the same number of exceptions across all versions.  Within the "without image extraction" pairs, the number of metadata elements was exactly the same, and within the "with image extraction" pairs, the number of metadata elements was exactly the same. 

Adding .clear() improved speed when not extracting images and decreased speed by a much smaller amount when extracting images.

||Run||Average Millis||Median Millis||
|Tika 1.6 Baseline|272.5|113.0|
|Tika 1.6 Page.clear()|243.6|85.0|
|Tika 1.6 Baseline Image Extraction|861.5|120.5|
|Tika 1.6 Image Extraction w/ 3x clear()|888.2|124.0|


> Decrease memory consumption when extracting images from PDFs
> ------------------------------------------------------------
>
>                 Key: TIKA-1375
>                 URL: https://issues.apache.org/jira/browse/TIKA-1375
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.6
>
>
> This patch applies changes made in PDFBOX-2101 to decrease memory consumption during extraction of embedded images.  This also applies the recommendation by [~tilman] on the PDFBox dev [list | http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201407.mbox/%3c53CFF0CE.9090507@t-online.de%3e] to clear resources after handling each page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)