You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Neil McErlean (JIRA)" <ji...@apache.org> on 2010/11/17 14:01:17 UTC
[jira] Updated: (PDFBOX-893) Performance improvement in
PDFStreamEngine and Matrix (patch included)
[ https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil McErlean updated PDFBOX-893:
---------------------------------
Attachment: PDFBOX_perf_patch.txt
Patch file
> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
> Key: PDFBOX-893
> URL: https://issues.apache.org/jira/browse/PDFBOX-893
> Project: PDFBox
> Issue Type: Improvement
> Components: Utilities
> Affects Versions: 1.3.1
> Environment: All
> Reporter: Neil McErlean
> Fix For: 1.4.0
>
> Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes a lot more than that.
> (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction and I see that a lot of time is spent constructing org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox during text extraction for a large document. When this screenshot was taken the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and then a further 7 within a loop which is called for each character in the document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, Matrix) which allows you to multiply two matrices and have the result stored in a specified Matrix object. This has the effect of reducing the number of temporary Matrix objects created during multiplication within PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't have an appreciable effect on the time it takes to extract text from my large documents.
> The profiling continues...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.