You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (JIRA)" <ji...@apache.org> on 2010/03/31 22:23:27 UTC

[jira] Updated: (PDFBOX-582) Ignoring text over images

     [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maruan Sahyoun updated PDFBOX-582:
----------------------------------

    Attachment: PageDrawer.patch

The patch adds a basic implementation for PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT in order to support applications where a text is invisibly included in a PDF as part of an OCR result.

A more generic approach needs to be implemented in order to fully support the different text rendering modes

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: PageDrawer.patch, pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.