You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Michael Howard <mi...@uforlife.com> on 2010/03/27 20:10:53 UTC

text behind scanned images should not render

Summary:

Scanned PDF documents with OCR text layer do not render the same with
pdfbox as with other pdf viewers.

Detail:

I am a pdfbox and pdf newbie working with a large set of .pdf files
from scanned documents. The documents are basically pages from books
of photos with captions.

The scanner software is running OCR on the captions and storing the
text in a layer behind the scanned pages.

When I view these .pdf files with OS X Preview or Win32 Acrobat Reader
I only see the scanned image.

When I render these .pdf files with pdfbox PDFReader or PDFToImage the
text layer is rendered on top of the page image. Not surprisingly in
most cases the text is staggered.

This looks like a bug to me.

For my application, I think I can work around this using
ExtractImages, ExtractTextByArea, etc.

I know nothing about PDF format. I am wondering ...

Q: Is there a tag in the PDF format that indicates that the OCR text
layer should not be rendered?


Michael

Re: text behind scanned images should not render

Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,

On Sat, Mar 27, 2010 at 9:10 PM, Michael Howard <mi...@uforlife.com> wrote:
> Summary:
>
> Scanned PDF documents with OCR text layer do not render the same with
> pdfbox as with other pdf viewers.
>

I've reported similar issue as PDFBOX-582:
https://issues.apache.org/jira/browse/PDFBOX-582


VR

Re: text behind scanned images should not render

Posted by Kenneth D Weinert <ke...@quarter-flash.com>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Howard wrote:

> Q: Is there a tag in the PDF format that indicates that the OCR text
> layer should not be rendered?

This can be done one of two ways: if the text is under the image it
won't normally be shown.

There's also a mode to draw text transparent. I know the company I work
for does it that way so that the text can be selected.

Ken
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuubdgACgkQH0OpnUzq8fA/LgCgvGE8z59ucu3k4xoqdWDK9Va4
miYAnjqUwH147bVO2F7WGkZQ4hdr5TLu
=/d/r
-----END PGP SIGNATURE-----