You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Neil McErlean (JIRA)" <ji...@apache.org> on 2010/11/17 14:01:17 UTC
[jira] Updated: (PDFBOX-893) Performance improvement in PDFStreamEngine and Matrix (patch included)

     [ https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neil McErlean updated PDFBOX-893:
---------------------------------

    Attachment: PDFBOX_perf_patch.txt

Patch file

> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-893
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 1.3.1
>         Environment: All
>            Reporter: Neil McErlean
>             Fix For: 1.4.0
>
>         Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes a lot more than that.
>     (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction and I see that a lot of time is spent constructing org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox during text extraction for a large document. When this screenshot was taken the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and then a further 7 within a loop which is called for each character in the document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, Matrix) which allows you to multiply two matrices and have the result stored in a specified Matrix object. This has the effect of reducing the number of temporary Matrix objects created during multiplication within PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't have an appreciable effect on the time it takes to extract text from my large documents.
> The profiling continues...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.