You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/01/18 11:43:54 UTC

[jira] Resolved: (PDFBOX-604) Various text extraction performance improvements

     [ https://issues.apache.org/jira/browse/PDFBOX-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved PDFBOX-604.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Jukka Zitting

See revisions 899802, 899804, 899806, 899807 and 899810 for the improvements I made. This covers pretty much all of the remaining immediate simple bottlenecks I could find through profiling, so I'm resolving this issue as fixed.

The biggest higher level performance bottleneck is the way o.a.p.util.PDFStreamEngine.processEncodedText() processes each glyph separately. We would likely see major performance improvements if we refactor things so that the entire
string of encoded glyphs is first decoded as a single operation and then any graphics transformations are applied to
that whole block before processing the characters. That, however, is best handled as a separate issue.

> Various text extraction performance improvements
> ------------------------------------------------
>
>                 Key: PDFBOX-604
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-604
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 1.0.0
>
>
> Even after Mel's recent patches I've found a number of small performance bottlenecks that we could get rid of.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.