You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael Howard (JIRA)" <ji...@apache.org> on 2010/03/31 16:14:27 UTC

[jira] Commented: (PDFBOX-582) Ignoring text over images

    [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851877#action_12851877 ] 

Michael Howard commented on PDFBOX-582:
---------------------------------------

Daniel is correct in that the vast majority of .pdf documents contain a mixture of images and text. 

However, I think that the point that Villu is making is that many/most .pdf docs that are generated through OCR scanning are treated differently. In my experience, these docs tend to display only the image. The underlying text is there for text selection and for searching, but the underlying OCR-generated text is not displayed. 

Note that the OCR error-rate is frequently quite high, but since the page image is what you view/read/print then it is generally fine. The high-error OCR is better than nothing. 

I have several .pdf docs from different scanner vendors. They function correctly on Acrobat Reader, Mac OS X Preview and Linux/Gnome Evince Viewer ... correctly in that only the image is rendered for display/printing. 

PDFBox 1.0.0 displays these documents incorrectly in that the fonts are rendered over the top of the page image. This makes the documents unusable because the rendered font chars overlay the char images. Because of alignment and OCR-error issues these documents become unreadable in PDFBox. 

I don't know much about the .pdf format, but I assume that there must be some indicator in the format which says that these fonts strings are not to be rendered. 

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.