You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Lars Torunski (JIRA)" <ji...@apache.org> on 2009/11/08 10:04:32 UTC

[jira] Created: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Performance regression from 0.7.3 to 0.8.0
------------------------------------------

                 Key: PDFBOX-556
                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 0.8.0-incubator
            Reporter: Lars Torunski


After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes a lot longer than expected.

E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 0.7.3 ==>  +50%

My first thought was that more pdfs are indexed or even indexed correctly with 0.8.0. But that shouldn't be an impact more than 50%.

Profiling with YourKit shows that a lot of time is spent in the method BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe somebody find out how to improve the performance here.

The method readUntilEndStream handles endobj tags in the stream also which impacts of course the performance, but this is OK.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Posted by "Lars Torunski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776571#action_12776571 ] 

Lars Torunski commented on PDFBOX-556:
--------------------------------------

In general the readUntilEndStream method needs 5-10% during the indexing process. Using the tracing option of the CPU profiling with YourKit increases the percentance, because the profiler has overhead counting the invocation of different included methods which e.g. return only one character.

The screenshot was taken during the first part of the parsing process. During this time 70% was spent in the readUntilEndStream method.

In common uses cases about 70% is spent in PDFTextStripper.writeText and 15% in PDDocument.load, which last methods includes readUntilEndStream.

> Performance regression from 0.7.3 to 0.8.0
> ------------------------------------------
>
>                 Key: PDFBOX-556
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Lars Torunski
>         Attachments: screenshot-1.jpg
>
>
> After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes a lot longer than expected.
> E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 0.7.3 ==>  +50%
> My first thought was that more pdfs are indexed or even indexed correctly with 0.8.0. But that shouldn't be an impact more than 50%.
> Profiling with YourKit shows that a lot of time is spent in the method BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe somebody find out how to improve the performance here.
> The method readUntilEndStream handles endobj tags in the stream also which impacts of course the performance, but this is OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777228#action_12777228 ] 

Andreas Lehmkühler commented on PDFBOX-556:
-------------------------------------------

Hmm, if in common cases 70% of the time is spent in writing the text, it seems to be obvious that readUntilStream can't be the reason for the performance impact of 50%. There were a lot of changes in the TextStripper part since 0.7.3, so I guess we have to look there for a possible performance loss.

> Performance regression from 0.7.3 to 0.8.0
> ------------------------------------------
>
>                 Key: PDFBOX-556
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Lars Torunski
>         Attachments: screenshot-1.jpg
>
>
> After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes a lot longer than expected.
> E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 0.7.3 ==>  +50%
> My first thought was that more pdfs are indexed or even indexed correctly with 0.8.0. But that shouldn't be an impact more than 50%.
> Profiling with YourKit shows that a lot of time is spent in the method BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe somebody find out how to improve the performance here.
> The method readUntilEndStream handles endobj tags in the stream also which impacts of course the performance, but this is OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Posted by "Lars Torunski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775035#action_12775035 ] 

Lars Torunski commented on PDFBOX-556:
--------------------------------------

I made the screenshot on Saturday and the next day I created a Eclipse project with the 0.8.0 java sources, resource files and different PDFs. Now I can't reproduce the high invocation counts of the read() and cmpCircularBuffer method anymore.

The performance impact of more than 50% is still correct and reproducible, but the test cases don't spent 70% in the readUntilEndStream anymore.

In version 0.8.0 the readUntilEndStream method was changed and checks for the endobj tag also.

Meanwhile I thought that the string searching algorithm for endobj and endstream can be improved by a simplified Boyer-Moore string search algorithm using the bad character heuristic only,  e.g. the Sunday algorithm. Both heuristics could be calculated at class instantiation. 

http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
http://de.wikipedia.org/wiki/Boyer-Moore-Algorithmus
http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/sunday.htm

> Performance regression from 0.7.3 to 0.8.0
> ------------------------------------------
>
>                 Key: PDFBOX-556
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Lars Torunski
>         Attachments: screenshot-1.jpg
>
>
> After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes a lot longer than expected.
> E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 0.7.3 ==>  +50%
> My first thought was that more pdfs are indexed or even indexed correctly with 0.8.0. But that shouldn't be an impact more than 50%.
> Profiling with YourKit shows that a lot of time is spent in the method BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe somebody find out how to improve the performance here.
> The method readUntilEndStream handles endobj tags in the stream also which impacts of course the performance, but this is OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774831#action_12774831 ] 

Andreas Lehmkühler commented on PDFBOX-556:
-------------------------------------------

Hmm, there are a lot of changes between these version.

Did you compare your 0.8.0 profiling results to the old 0.7.3 version? The invocation of cmpCircularBuffer is probably a good hint but perhaps it was already a bottle neck in the old version .... 
Did you try to disable/minimize the logging? It is often an perfomance issue if there is too much log output.

> Performance regression from 0.7.3 to 0.8.0
> ------------------------------------------
>
>                 Key: PDFBOX-556
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Lars Torunski
>         Attachments: screenshot-1.jpg
>
>
> After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes a lot longer than expected.
> E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 0.7.3 ==>  +50%
> My first thought was that more pdfs are indexed or even indexed correctly with 0.8.0. But that shouldn't be an impact more than 50%.
> Profiling with YourKit shows that a lot of time is spent in the method BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe somebody find out how to improve the performance here.
> The method readUntilEndStream handles endobj tags in the stream also which impacts of course the performance, but this is OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Posted by "Lars Torunski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Torunski updated PDFBOX-556:
---------------------------------

    Attachment: screenshot-1.jpg

> Performance regression from 0.7.3 to 0.8.0
> ------------------------------------------
>
>                 Key: PDFBOX-556
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Lars Torunski
>         Attachments: screenshot-1.jpg
>
>
> After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes a lot longer than expected.
> E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 0.7.3 ==>  +50%
> My first thought was that more pdfs are indexed or even indexed correctly with 0.8.0. But that shouldn't be an impact more than 50%.
> Profiling with YourKit shows that a lot of time is spent in the method BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe somebody find out how to improve the performance here.
> The method readUntilEndStream handles endobj tags in the stream also which impacts of course the performance, but this is OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.