You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (Created) (JIRA)" <ji...@apache.org> on 2011/11/22 15:30:40 UTC

[jira] [Created] (PDFBOX-1175) Stream parsing performance improvement + patch

Stream parsing performance improvement + patch
----------------------------------------------

                 Key: PDFBOX-1175
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1175
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 1.7.0
            Reporter: Timo Boehme
            Priority: Minor


Stream parsing is one of the critical parts looked from a performance point of view since typically most data is stored in streams. While PDFBOX already got some speedup some time ago in the method copying stream data from file to random access buffer (BaseParser#readUntilEndStream) there is some room for improvement.

The problem with the current implementation is the byte wise reading and writing of the data. I have rewritten the method using byte arrays for IO and optimized the number of needed comparisons for finding 'endstream'/'endobj'. This results in 7-8 times faster parsing of streams and a 3-4 times faster parsing of a normal 10 page PDF.

See the attached file which is a drop in replacement for the readUntilEndStream method in BaseParser.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1175) Stream parsing performance improvement + patch

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1175:
--------------------------------

    Attachment: BaseParser_readUntilEndStream.java

the optimized method (BaseParser#readUntilEndStream) for copying stream data from file to random buffer
                
> Stream parsing performance improvement + patch
> ----------------------------------------------
>
>                 Key: PDFBOX-1175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1175
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.7.0
>            Reporter: Timo Boehme
>            Priority: Minor
>         Attachments: BaseParser_readUntilEndStream.java
>
>
> Stream parsing is one of the critical parts looked from a performance point of view since typically most data is stored in streams. While PDFBOX already got some speedup some time ago in the method copying stream data from file to random access buffer (BaseParser#readUntilEndStream) there is some room for improvement.
> The problem with the current implementation is the byte wise reading and writing of the data. I have rewritten the method using byte arrays for IO and optimized the number of needed comparisons for finding 'endstream'/'endobj'. This results in 7-8 times faster parsing of streams and a 3-4 times faster parsing of a normal 10 page PDF.
> See the attached file which is a drop in replacement for the readUntilEndStream method in BaseParser.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1175) Stream parsing performance improvement + patch

Posted by "Andreas Lehmkühler (Updated JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-1175:
---------------------------------------

    Affects Version/s:     (was: 1.7.0)
                       1.6.0
    
> Stream parsing performance improvement + patch
> ----------------------------------------------
>
>                 Key: PDFBOX-1175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1175
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Priority: Minor
>         Attachments: BaseParser_readUntilEndStream.java
>
>
> Stream parsing is one of the critical parts looked from a performance point of view since typically most data is stored in streams. While PDFBOX already got some speedup some time ago in the method copying stream data from file to random access buffer (BaseParser#readUntilEndStream) there is some room for improvement.
> The problem with the current implementation is the byte wise reading and writing of the data. I have rewritten the method using byte arrays for IO and optimized the number of needed comparisons for finding 'endstream'/'endobj'. This results in 7-8 times faster parsing of streams and a 3-4 times faster parsing of a normal 10 page PDF.
> See the attached file which is a drop in replacement for the readUntilEndStream method in BaseParser.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (PDFBOX-1175) Stream parsing performance improvement + patch

Posted by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-1175.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

I added the improvements in revision 1209088 as proposed. I also added some string constants and reformatted some of the code.

Thanks for the contribution!
                
> Stream parsing performance improvement + patch
> ----------------------------------------------
>
>                 Key: PDFBOX-1175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1175
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.7.0
>
>         Attachments: BaseParser_readUntilEndStream.java
>
>
> Stream parsing is one of the critical parts looked from a performance point of view since typically most data is stored in streams. While PDFBOX already got some speedup some time ago in the method copying stream data from file to random access buffer (BaseParser#readUntilEndStream) there is some room for improvement.
> The problem with the current implementation is the byte wise reading and writing of the data. I have rewritten the method using byte arrays for IO and optimized the number of needed comparisons for finding 'endstream'/'endobj'. This results in 7-8 times faster parsing of streams and a 3-4 times faster parsing of a normal 10 page PDF.
> See the attached file which is a drop in replacement for the readUntilEndStream method in BaseParser.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira