You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Villu Ruusmann (JIRA)" <ji...@apache.org> on 2009/12/09 12:54:18 UTC

[jira] Created: (PDFBOX-582) Ignoring text over images

Ignoring text over images
-------------------------

                 Key: PDFBOX-582
                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction, Utilities
    Affects Versions: 0.8.0-incubator
            Reporter: Villu Ruusmann


Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.

PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.

Actually, there are two separate cases:
*) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
*) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-582) Ignoring text over images

Posted by "Daniel Wilson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788075#action_12788075 ] 

Daniel Wilson commented on PDFBOX-582:
--------------------------------------

If ignoring text when rendering is added, it should be an OPTION, and NOT the default.

Plenty of documents such as electronic brochures lay text over images and all of the content IS to be seen.

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-582) Ignoring text over images

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788205#action_12788205 ] 

Adam Nichols commented on PDFBOX-582:
-------------------------------------

+1 for Daniel's comment

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-582) Ignoring text over images

Posted by "Villu Ruusmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Villu Ruusmann updated PDFBOX-582:
----------------------------------

    Attachment: pg_0005.png

Sure, there are many things to bear in mind.

In my case, the underlying image is located at coordinates [0:0] and is stretched all over page size. When rendering this document with PDFBox 0.8.0-incubating, the image is not displayed (something wrong with CCITTFaxDecode?), but the text is, albeit shouldn't be.

Note that the OCRed text is full of mistakes (especially the table section) - if the underlying image was visible, it would look awkward.

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-582) Ignoring text over images

Posted by "Villu Ruusmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788049#action_12788049 ] 

Villu Ruusmann edited comment on PDFBOX-582 at 12/9/09 3:27 PM:
----------------------------------------------------------------

A typical "text over image" PDF document.

  
> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.