You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2016/02/02 23:00:40 UTC
[jira] [Commented] (PDFBOX-3224) Cache Font Bounding Boxes for Performance in Text Extraction

    [ https://issues.apache.org/jira/browse/PDFBOX-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129122#comment-15129122 ] 

Tilman Hausherr commented on PDFBOX-3224:
-----------------------------------------

I agree that the code behind {{getBoundingBox()}} has potential for optimization. However...
{quote}
There are a variety of other ways to accomplish the same thing – caching inside of the various font objects themselves, etc.
{quote}
because using {{TextState}} isn't the right place, this object mirrors the PDF specification. I'd favor caching in the font objects, like it is already done in PDType3Font. Coincidentally, I did that one - because the calculations were getting too much.

> Cache Font Bounding Boxes for Performance in Text Extraction
> ------------------------------------------------------------
>
>                 Key: PDFBOX-3224
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3224
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Tom Callahan
>         Attachments: bounding-box-caching.patch
>
>
> Hi,
> I have been using pdfbox by way of Tika for a while for text extraction from PDFs.  I had a chance to fire up a profiler recently and found that getBoundingBox() in the PDXXFont.java classes are called fairly frequently -- in particular from PDFTextStreamEngine.showGlyph().  I've attached a patch that caches the BoundingBox object alongside the PDFont object inside of PDTextState.  There are a variety of other ways to accomplish the same thing -- caching inside of the various font objects themselves, etc.
> I wrote a little test program to measure the speed difference against a few randomly selected files.  The program just uses PDFTextStripper to retrieve raw text from a PDF.
> Here's what I found:
> ====plain====
> File: BambooCheatSheet.pdf Duration: 60037555619 rate: 81.6 files/sec
> File: flu.pdf Duration: 60019978409 rate: 34.46666666666667 files/sec
> File: megacli_user_guide.pdf Duration: 60641314800 rate: 1.1833333333333333 files/sec
> File: odbc-perl.pdf Duration: 60008216404 rate: 19.466666666666665 files/sec
> File: VerticaArchitectureWhitePaper.pdf Duration: 60084726865 rate: 7.433333333333334 files/sec
> File: WritingaResume.pdf Duration: 60015267784 rate: 59.4 files/sec
> ===boundingbox caching===
> File: BambooCheatSheet.pdf Duration: 60005724588 rate: 106.1 files/sec
> File: flu.pdf Duration: 60021410660 rate: 41.916666666666664 files/sec
> File: megacli_user_guide.pdf Duration: 60107488363 rate: 1.7833333333333334 files/sec
> File: odbc-perl.pdf Duration: 60017784515 rate: 29.9 files/sec
> File: VerticaArchitectureWhitePaper.pdf Duration: 60012261509 rate: 9.05 files/sec
> File: WritingaResume.pdf Duration: 60007995996 rate: 76.5 files/sec
> Cheers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org