You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2016/02/03 18:06:39 UTC

[jira] [Resolved] (PDFBOX-3224) Cache Font Bounding Boxes for Performance in Text Extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr resolved PDFBOX-3224.
-------------------------------------
    Resolution: Fixed
      Assignee: Tilman Hausherr

> Cache Font Bounding Boxes for Performance in Text Extraction
> ------------------------------------------------------------
>
>                 Key: PDFBOX-3224
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3224
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>    Affects Versions: 2.0.0
>            Reporter: Tom Callahan
>            Assignee: Tilman Hausherr
>              Labels: optimization
>             Fix For: 2.0.0
>
>         Attachments: bounding-box-caching.patch, pdfont-bounding-box-caching.patch
>
>
> Hi,
> I have been using pdfbox by way of Tika for a while for text extraction from PDFs.  I had a chance to fire up a profiler recently and found that getBoundingBox() in the PDXXFont.java classes are called fairly frequently -- in particular from PDFTextStreamEngine.showGlyph().  I've attached a patch that caches the BoundingBox object alongside the PDFont object inside of PDTextState.  There are a variety of other ways to accomplish the same thing -- caching inside of the various font objects themselves, etc.
> I wrote a little test program to measure the speed difference against a few randomly selected files.  The program just uses PDFTextStripper to retrieve raw text from a PDF.
> Here's what I found:
> ====plain====
> File: BambooCheatSheet.pdf Duration: 60037555619 rate: 81.6 files/sec
> File: flu.pdf Duration: 60019978409 rate: 34.46666666666667 files/sec
> File: megacli_user_guide.pdf Duration: 60641314800 rate: 1.1833333333333333 files/sec
> File: odbc-perl.pdf Duration: 60008216404 rate: 19.466666666666665 files/sec
> File: VerticaArchitectureWhitePaper.pdf Duration: 60084726865 rate: 7.433333333333334 files/sec
> File: WritingaResume.pdf Duration: 60015267784 rate: 59.4 files/sec
> ===boundingbox caching===
> File: BambooCheatSheet.pdf Duration: 60005724588 rate: 106.1 files/sec
> File: flu.pdf Duration: 60021410660 rate: 41.916666666666664 files/sec
> File: megacli_user_guide.pdf Duration: 60107488363 rate: 1.7833333333333334 files/sec
> File: odbc-perl.pdf Duration: 60017784515 rate: 29.9 files/sec
> File: VerticaArchitectureWhitePaper.pdf Duration: 60012261509 rate: 9.05 files/sec
> File: WritingaResume.pdf Duration: 60007995996 rate: 76.5 files/sec
> Cheers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org