You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Michael Howard <mi...@uforlife.com> on 2010/03/27 20:10:53 UTC
text behind scanned images should not render
Summary:
Scanned PDF documents with OCR text layer do not render the same with
pdfbox as with other pdf viewers.
Detail:
I am a pdfbox and pdf newbie working with a large set of .pdf files
from scanned documents. The documents are basically pages from books
of photos with captions.
The scanner software is running OCR on the captions and storing the
text in a layer behind the scanned pages.
When I view these .pdf files with OS X Preview or Win32 Acrobat Reader
I only see the scanned image.
When I render these .pdf files with pdfbox PDFReader or PDFToImage the
text layer is rendered on top of the page image. Not surprisingly in
most cases the text is staggered.
This looks like a bug to me.
For my application, I think I can work around this using
ExtractImages, ExtractTextByArea, etc.
I know nothing about PDF format. I am wondering ...
Q: Is there a tag in the PDF format that indicates that the OCR text
layer should not be rendered?
Michael
Re: text behind scanned images should not render
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
On Sat, Mar 27, 2010 at 9:10 PM, Michael Howard <mi...@uforlife.com> wrote:
> Summary:
>
> Scanned PDF documents with OCR text layer do not render the same with
> pdfbox as with other pdf viewers.
>
I've reported similar issue as PDFBOX-582:
https://issues.apache.org/jira/browse/PDFBOX-582
VR
Re: text behind scanned images should not render
Posted by Kenneth D Weinert <ke...@quarter-flash.com>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Michael Howard wrote:
> Q: Is there a tag in the PDF format that indicates that the OCR text
> layer should not be rendered?
This can be done one of two ways: if the text is under the image it
won't normally be shown.
There's also a mode to draw text transparent. I know the company I work
for does it that way so that the text can be selected.
Ken
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkuubdgACgkQH0OpnUzq8fA/LgCgvGE8z59ucu3k4xoqdWDK9Va4
miYAnjqUwH147bVO2F7WGkZQ4hdr5TLu
=/d/r
-----END PGP SIGNATURE-----