You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Neil McErlean (JIRA)" <ji...@apache.org> on 2010/11/17 13:59:13 UTC

[jira] Created: (PDFBOX-893) Performance improvement in PDFStreamEngine and Matrix (patch included)

Performance improvement in PDFStreamEngine and Matrix (patch included)
----------------------------------------------------------------------

                 Key: PDFBOX-893
                 URL: https://issues.apache.org/jira/browse/PDFBOX-893
             Project: PDFBox
          Issue Type: Improvement
          Components: Utilities
    Affects Versions: 1.3.1
         Environment: All
            Reporter: Neil McErlean
             Fix For: 1.4.0


I've been profiling PDFBox during text extraction from some large PDF documents e.g. 2000 pages, mostly text, 20 Mb file size.
Some of these documents can take a long time to process e.g. 40s+, sometimes a lot more than that.
    (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m -XX:PermSize=256m)

I've begun by profiling where the code spends its time during text extraction and I see that a lot of time is spent constructing org.apache.pdfbox.util.Matrix objects.
Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox during text extraction for a large document. When this screenshot was taken the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time apparently - the largest time of any method. I was surprised.

Most of these Matrix instances are being constructed within PDFStreamEngine.prcoessEncodedText(byte[])
On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and then a further 7 within a loop which is called for each character in the document. So that's a lot of Matrix objects.

The attached patch refactors PDFStreamEngine.processEncodedText so that it now creates 5 reusable Matrix instances outside the loop and 2 within it.
This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, Matrix) which allows you to multiply two matrices and have the result stored in a specified Matrix object. This has the effect of reducing the number of temporary Matrix objects created during multiplication within PDFStreamEngine. This should save the garbage collector some work.

I profiled PDFBox again with this patch included and Matrix.<init> now accounts for only 30% of the cpu time.
Unfortunately, whilst less temporary objects are being created, it doesn't have an appreciable effect on the time it takes to extract text from my large documents.

The profiling continues...



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-893) Performance improvement in PDFStreamEngine and Matrix (patch included)

Posted by "Neil McErlean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neil McErlean updated PDFBOX-893:
---------------------------------

    Priority: Minor  (was: Major)

> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-893
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 1.3.1
>         Environment: All
>            Reporter: Neil McErlean
>            Priority: Minor
>             Fix For: 1.4.0
>
>         Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes a lot more than that.
>     (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction and I see that a lot of time is spent constructing org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox during text extraction for a large document. When this screenshot was taken the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and then a further 7 within a loop which is called for each character in the document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, Matrix) which allows you to multiply two matrices and have the result stored in a specified Matrix object. This has the effect of reducing the number of temporary Matrix objects created during multiplication within PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't have an appreciable effect on the time it takes to extract text from my large documents.
> The profiling continues...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-893) Performance improvement in PDFStreamEngine and Matrix (patch included)

Posted by "Neil McErlean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neil McErlean updated PDFBOX-893:
---------------------------------

    Attachment: PDFBOX_perf_patch.txt

Patch file

> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-893
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 1.3.1
>         Environment: All
>            Reporter: Neil McErlean
>             Fix For: 1.4.0
>
>         Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes a lot more than that.
>     (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction and I see that a lot of time is spent constructing org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox during text extraction for a large document. When this screenshot was taken the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and then a further 7 within a loop which is called for each character in the document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, Matrix) which allows you to multiply two matrices and have the result stored in a specified Matrix object. This has the effect of reducing the number of temporary Matrix objects created during multiplication within PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't have an appreciable effect on the time it takes to extract text from my large documents.
> The profiling continues...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-893) Performance improvement in PDFStreamEngine and Matrix (patch included)

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-893.
---------------------------------------

    Resolution: Fixed
      Assignee: Andreas Lehmkühler

I added the patch in revision 1044823 as proposed by Neil McErlean. I made some minor tweaks to the PDStreamEngine part, as some code was altered before.

Thanks for the contribution!!

> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-893
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 1.3.1
>         Environment: All
>            Reporter: Neil McErlean
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.4.0
>
>         Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes a lot more than that.
>     (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction and I see that a lot of time is spent constructing org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox during text extraction for a large document. When this screenshot was taken the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and then a further 7 within a loop which is called for each character in the document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, Matrix) which allows you to multiply two matrices and have the result stored in a specified Matrix object. This has the effect of reducing the number of temporary Matrix objects created during multiplication within PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't have an appreciable effect on the time it takes to extract text from my large documents.
> The profiling continues...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.