You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Hai Nguyen FUB <ha...@fu-berlin.de> on 2013/07/01 17:06:02 UTC

handwritten PDF file to Image not working

Dear Pdfbox-developers,

My name is Hai and I am a java developer at the Freie Universtät Berlin.

I am currently working on a project, which deals with converting pdf files
to images. I have looked around and found the Pdfbox library to be a good
pdf handling tool.

After awhile working with this tool, I got stucked on a problem: whenever I
tried to convert a handwritten pdf file, which means those files are
handwritten documents and were scanned and exported to pdf files (I do not
have the original images files), I received the following errors:

16:54:08,965 ERROR [FlateFilter] FlateFilter: stop reading corrupt stream
> due to a DataFormatException


could you give me a hint, how to solve it?

my code snapshot is in the following:

PDDocument document = PDDocument.load(new
> File("src/test/resources/pdf/249scan.pdf"));
>


@SuppressWarnings("unchecked")
> List<PDPage> pages = document.getDocumentCatalog().getAllPages();
>
> PDPage page = pages.get(0);
> BufferedImage bi = page.convertToImage();
> ImageIO.write(bi, "png", new File("src/test/resources/pdf/test.png"));



Thank you in advance & Best regards

--
Hai Nguyen

Freie Universität Berlin
FB Mathematik u. Informatik
AG Intelligente Systeme und
Robotik<http://inf.fu-berlin.de/groups/ag-ki/index.html>
Arnimallee 7, Raum 111
D-14195 Berlin

Tel-1: +49 / 30 838 75114 ( Arnimallee - FUB)
Tel-2: +49 / 30 838 75148 (Takustr - FUB)
Tel-3: +49 / 30 2093 6381 (Office upstaris - HUB)
Tel-4: +49 / 30 2093 6393 (Lab downstairs - HUB)
Fax:    +49 / 30 838 75059
__________________________

Re: handwritten PDF file to Image not working

Posted by Hai Nguyen FUB <ha...@fu-berlin.de>.

Dear Andreas,

Thanks for the fast reply!

--hai

On Tue, Jul 2, 2013 at 1:34 PM, Andreas Lehmkuehler <an...@lehmi.de>wrote:

> Hi,
>
>
> Am 02.07.2013 11:54, schrieb Hai Nguyen FUB:
>
>> Dear Andreas,
>>
>>
>> I have another question, for some documents, when converting them into
>> images, I received warnings like in the following:
>>
>> <snapshot>
>> ...
>> 11:42:14,895 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
>> default font
>> 11:42:14,900 WARN  [PDSimpleFont] Changing font on <u> from <Arial> to the
>> default font
>> 11:42:14,901 WARN  [PDSimpleFont] Changing font on <t> from <Arial> to the
>> default font
>> 11:42:14,901 WARN  [PDSimpleFont] Changing font on <s> from <Arial> to the
>> default font
>> 11:42:14,902 WARN  [PDSimpleFont] Changing font on <c> from <Arial> to the
>> default font
>> 11:42:14,903 WARN  [PDSimpleFont] Changing font on <h> from <Arial> to the
>> default font
>> 11:42:14,903 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
>> default font
>> 11:42:14,906 WARN  [PDSimpleFont] Changing font on <o> from <Arial> to the
>> default font
>> 11:42:14,907 WARN  [PDSimpleFont] Changing font on <r> from <Arial> to the
>> default font
>> ...
>> </snapshot>
>>
>> Those warning could be deactivated in the logging.property file, I guess.
>> Though, the images were still created, however the images display wrong
>> characters, please see the comparison in the attached image file.
>>
>> How can I solve this? I have look around in the documentation and googled
>> a
>> lot, but could not find any solutions.
>>
> This is a known behaviour of PDFBox. As the embedded font doesn't work for
> some
> reason an alternative font is used. In some cases it works but in most
> cases it
> doesn't. There is no solution, yet. Most likely the issue is related to
> PDFBOX-490 [1]
>
>
>  Is there a way to omit the character parsing, since my application is only
>> to convert the file to image and no ocr or the like? I have used the
>> loadNonSeq() method, but still received those poor characters in the
>> images.
>>
> No, it is needed to render the text and it has nothing to do with the
> parser
> itself.
>
>  thanks in advance!
>>
>> --hai
>>
>
> BR
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/**jira/browse/PDFBOX-490<https://issues.apache.org/jira/browse/PDFBOX-490>
>

Re: handwritten PDF file to Image not working

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 02.07.2013 11:54, schrieb Hai Nguyen FUB:
> Dear Andreas,
>
> I have another question, for some documents, when converting them into
> images, I received warnings like in the following:
>
> <snapshot>
> ...
> 11:42:14,895 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
> default font
> 11:42:14,900 WARN  [PDSimpleFont] Changing font on <u> from <Arial> to the
> default font
> 11:42:14,901 WARN  [PDSimpleFont] Changing font on <t> from <Arial> to the
> default font
> 11:42:14,901 WARN  [PDSimpleFont] Changing font on <s> from <Arial> to the
> default font
> 11:42:14,902 WARN  [PDSimpleFont] Changing font on <c> from <Arial> to the
> default font
> 11:42:14,903 WARN  [PDSimpleFont] Changing font on <h> from <Arial> to the
> default font
> 11:42:14,903 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
> default font
> 11:42:14,906 WARN  [PDSimpleFont] Changing font on <o> from <Arial> to the
> default font
> 11:42:14,907 WARN  [PDSimpleFont] Changing font on <r> from <Arial> to the
> default font
> ...
> </snapshot>
>
> Those warning could be deactivated in the logging.property file, I guess.
> Though, the images were still created, however the images display wrong
> characters, please see the comparison in the attached image file.
>
> How can I solve this? I have look around in the documentation and googled a
> lot, but could not find any solutions.
This is a known behaviour of PDFBox. As the embedded font doesn't work for some
reason an alternative font is used. In some cases it works but in most cases it
doesn't. There is no solution, yet. Most likely the issue is related to
PDFBOX-490 [1]

> Is there a way to omit the character parsing, since my application is only
> to convert the file to image and no ocr or the like? I have used the
> loadNonSeq() method, but still received those poor characters in the images.
No, it is needed to render the text and it has nothing to do with the parser
itself.

> thanks in advance!
>
> --hai

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-490

Re: handwritten PDF file to Image not working

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 02.07.2013 11:55, schrieb Hai Nguyen FUB:
> Sorry, I have forgotten the attached image file
FTR: the attachment didn't make it due to some restriction to the mailing list.

> thanks,
>
> --hai
>

BR
Andreas Lehmkühler

> On Tue, Jul 2, 2013 at 11:54 AM, Hai Nguyen FUB <hai.nguyen@fu-berlin.de
> <ma...@fu-berlin.de>> wrote:
>
>     Dear Andreas,
>
>     I have another question, for some documents, when converting them into
>     images, I received warnings like in the following:
>
>     <snapshot>
>     ...
>     11:42:14,895 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
>     default font
>     11:42:14,900 WARN  [PDSimpleFont] Changing font on <u> from <Arial> to the
>     default font
>     11:42:14,901 WARN  [PDSimpleFont] Changing font on <t> from <Arial> to the
>     default font
>     11:42:14,901 WARN  [PDSimpleFont] Changing font on <s> from <Arial> to the
>     default font
>     11:42:14,902 WARN  [PDSimpleFont] Changing font on <c> from <Arial> to the
>     default font
>     11:42:14,903 WARN  [PDSimpleFont] Changing font on <h> from <Arial> to the
>     default font
>     11:42:14,903 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
>     default font
>     11:42:14,906 WARN  [PDSimpleFont] Changing font on <o> from <Arial> to the
>     default font
>     11:42:14,907 WARN  [PDSimpleFont] Changing font on <r> from <Arial> to the
>     default font
>     ...
>     </snapshot>
>
>     Those warning could be deactivated in the logging.property file, I guess.
>     Though, the images were still created, however the images display wrong
>     characters, please see the comparison in the attached image file.
>
>     How can I solve this? I have look around in the documentation and googled a
>     lot, but could not find any solutions.
>
>     Is there a way to omit the character parsing, since my application is only
>     to convert the file to image and no ocr or the like? I have used the
>     loadNonSeq() method, but still received those poor characters in the images.
>
>     thanks in advance!
>
>     --hai
>
>
>
>     On Mon, Jul 1, 2013 at 6:57 PM, Hai Nguyen FUB <hai.nguyen@fu-berlin.de
>     <ma...@fu-berlin.de>> wrote:
>
>         alright, thank you very much for the fast reply!!!
>
>         --hai
>
>
>         On Mon, Jul 1, 2013 at 6:52 PM, Andreas Lehmkuehler <andreas@lehmi.de
>         <ma...@lehmi.de>> wrote:
>
>             Am 01.07.2013 18:30, schrieb Hai Nguyen FUB:
>
>                 Hi Andreas,
>
>                 thank you very much, it works!!!
>
>                 though I still have warning notifications as following:
>
>                 18:26:54,687 WARN  [NonSequentialPDFParser] PDF file
>
>                     'src\test\resources\pdf\__249scan.pdf' does not allow
>                     extracting content.
>
>
>                 does this extracting means that the fonts or characters within
>                 the document
>                 are not extractable?
>
>             It is possible to define user access permissions for a pdf, such as
>
>             - disallow/allow printing
>             - disallow/allow text extraction
>             - disallow/allow modify the pdf
>             - ....
>
>             I your case, it is not allowed to extract the content of the pdf as
>             text.
>
>                 thanks,
>
>                 --hai
>                 SNIP
>
>
>             BR
>             Andreas Lehmkühler
>
>
>
>

Re: handwritten PDF file to Image not working

Posted by Hai Nguyen FUB <ha...@fu-berlin.de>.

Sorry, I have forgotten the attached image file

thanks,

--hai

On Tue, Jul 2, 2013 at 11:54 AM, Hai Nguyen FUB <ha...@fu-berlin.de>wrote:

> Dear Andreas,
>
> I have another question, for some documents, when converting them into
> images, I received warnings like in the following:
>
> <snapshot>
> ...
> 11:42:14,895 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
> default font
> 11:42:14,900 WARN  [PDSimpleFont] Changing font on <u> from <Arial> to the
> default font
> 11:42:14,901 WARN  [PDSimpleFont] Changing font on <t> from <Arial> to the
> default font
> 11:42:14,901 WARN  [PDSimpleFont] Changing font on <s> from <Arial> to the
> default font
> 11:42:14,902 WARN  [PDSimpleFont] Changing font on <c> from <Arial> to the
> default font
>  11:42:14,903 WARN  [PDSimpleFont] Changing font on <h> from <Arial> to
> the default font
> 11:42:14,903 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
> default font
> 11:42:14,906 WARN  [PDSimpleFont] Changing font on <o> from <Arial> to the
> default font
> 11:42:14,907 WARN  [PDSimpleFont] Changing font on <r> from <Arial> to the
> default font
> ...
> </snapshot>
>
> Those warning could be deactivated in the logging.property file, I guess.
> Though, the images were still created, however the images display wrong
> characters, please see the comparison in the attached image file.
>
> How can I solve this? I have look around in the documentation and googled
> a lot, but could not find any solutions.
>
> Is there a way to omit the character parsing, since my application is only
> to convert the file to image and no ocr or the like? I have used the
> loadNonSeq() method, but still received those poor characters in the images.
>
> thanks in advance!
>
> --hai
>
>
>
> On Mon, Jul 1, 2013 at 6:57 PM, Hai Nguyen FUB <ha...@fu-berlin.de>wrote:
>
>> alright, thank you very much for the fast reply!!!
>>
>> --hai
>>
>>
>> On Mon, Jul 1, 2013 at 6:52 PM, Andreas Lehmkuehler <an...@lehmi.de>wrote:
>>
>>> Am 01.07.2013 18:30, schrieb Hai Nguyen FUB:
>>>
>>>  Hi Andreas,
>>>>
>>>> thank you very much, it works!!!
>>>>
>>>> though I still have warning notifications as following:
>>>>
>>>> 18:26:54,687 WARN  [NonSequentialPDFParser] PDF file
>>>>
>>>>> 'src\test\resources\pdf\**249scan.pdf' does not allow extracting
>>>>> content.
>>>>>
>>>>>
>>>> does this extracting means that the fonts or characters within the
>>>> document
>>>> are not extractable?
>>>>
>>> It is possible to define user access permissions for a pdf, such as
>>>
>>> - disallow/allow printing
>>> - disallow/allow text extraction
>>> - disallow/allow modify the pdf
>>> - ....
>>>
>>> I your case, it is not allowed to extract the content of the pdf as text.
>>>
>>>  thanks,
>>>>
>>>> --hai
>>>> SNIP
>>>>
>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>
>>
>

Re: handwritten PDF file to Image not working

Posted by Hai Nguyen FUB <ha...@fu-berlin.de>.

Dear Andreas,

I have another question, for some documents, when converting them into
images, I received warnings like in the following:

<snapshot>
...
11:42:14,895 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
default font
11:42:14,900 WARN  [PDSimpleFont] Changing font on <u> from <Arial> to the
default font
11:42:14,901 WARN  [PDSimpleFont] Changing font on <t> from <Arial> to the
default font
11:42:14,901 WARN  [PDSimpleFont] Changing font on <s> from <Arial> to the
default font
11:42:14,902 WARN  [PDSimpleFont] Changing font on <c> from <Arial> to the
default font
11:42:14,903 WARN  [PDSimpleFont] Changing font on <h> from <Arial> to the
default font
11:42:14,903 WARN  [PDSimpleFont] Changing font on <e> from <Arial> to the
default font
11:42:14,906 WARN  [PDSimpleFont] Changing font on <o> from <Arial> to the
default font
11:42:14,907 WARN  [PDSimpleFont] Changing font on <r> from <Arial> to the
default font
...
</snapshot>

Those warning could be deactivated in the logging.property file, I guess.
Though, the images were still created, however the images display wrong
characters, please see the comparison in the attached image file.

How can I solve this? I have look around in the documentation and googled a
lot, but could not find any solutions.

Is there a way to omit the character parsing, since my application is only
to convert the file to image and no ocr or the like? I have used the
loadNonSeq() method, but still received those poor characters in the images.

thanks in advance!

--hai

On Mon, Jul 1, 2013 at 6:57 PM, Hai Nguyen FUB <ha...@fu-berlin.de>wrote:

> alright, thank you very much for the fast reply!!!
>
> --hai
>
>
> On Mon, Jul 1, 2013 at 6:52 PM, Andreas Lehmkuehler <an...@lehmi.de>wrote:
>
>> Am 01.07.2013 18:30, schrieb Hai Nguyen FUB:
>>
>>  Hi Andreas,
>>>
>>> thank you very much, it works!!!
>>>
>>> though I still have warning notifications as following:
>>>
>>> 18:26:54,687 WARN  [NonSequentialPDFParser] PDF file
>>>
>>>> 'src\test\resources\pdf\**249scan.pdf' does not allow extracting
>>>> content.
>>>>
>>>>
>>> does this extracting means that the fonts or characters within the
>>> document
>>> are not extractable?
>>>
>> It is possible to define user access permissions for a pdf, such as
>>
>> - disallow/allow printing
>> - disallow/allow text extraction
>> - disallow/allow modify the pdf
>> - ....
>>
>> I your case, it is not allowed to extract the content of the pdf as text.
>>
>>  thanks,
>>>
>>> --hai
>>> SNIP
>>>
>>
>> BR
>> Andreas Lehmkühler
>>
>
>

Re: handwritten PDF file to Image not working

Posted by Hai Nguyen FUB <ha...@fu-berlin.de>.

alright, thank you very much for the fast reply!!!

--hai

On Mon, Jul 1, 2013 at 6:52 PM, Andreas Lehmkuehler <an...@lehmi.de>wrote:

> Am 01.07.2013 18:30, schrieb Hai Nguyen FUB:
>
>  Hi Andreas,
>>
>> thank you very much, it works!!!
>>
>> though I still have warning notifications as following:
>>
>> 18:26:54,687 WARN  [NonSequentialPDFParser] PDF file
>>
>>> 'src\test\resources\pdf\**249scan.pdf' does not allow extracting
>>> content.
>>>
>>>
>> does this extracting means that the fonts or characters within the
>> document
>> are not extractable?
>>
> It is possible to define user access permissions for a pdf, such as
>
> - disallow/allow printing
> - disallow/allow text extraction
> - disallow/allow modify the pdf
> - ....
>
> I your case, it is not allowed to extract the content of the pdf as text.
>
>  thanks,
>>
>> --hai
>> SNIP
>>
>
> BR
> Andreas Lehmkühler
>

Re: handwritten PDF file to Image not working

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 01.07.2013 18:30, schrieb Hai Nguyen FUB:
> Hi Andreas,
>
> thank you very much, it works!!!
>
> though I still have warning notifications as following:
>
> 18:26:54,687 WARN  [NonSequentialPDFParser] PDF file
>> 'src\test\resources\pdf\249scan.pdf' does not allow extracting content.
>>
>
> does this extracting means that the fonts or characters within the document
> are not extractable?
It is possible to define user access permissions for a pdf, such as

- disallow/allow printing
- disallow/allow text extraction
- disallow/allow modify the pdf
- ....

I your case, it is not allowed to extract the content of the pdf as text.

> thanks,
>
> --hai
>SNIP

BR
Andreas Lehmkühler

Re: handwritten PDF file to Image not working

Posted by Hai Nguyen FUB <ha...@fu-berlin.de>.

Hi Andreas,

thank you very much, it works!!!

though I still have warning notifications as following:

18:26:54,687 WARN  [NonSequentialPDFParser] PDF file
> 'src\test\resources\pdf\249scan.pdf' does not allow extracting content.
>

does this extracting means that the fonts or characters within the document
are not extractable?

thanks,

--hai

On Mon, Jul 1, 2013 at 6:15 PM, Andreas Lehmkuehler <an...@lehmi.de>wrote:

> Hi,
>
> Am 01.07.2013 17:06, schrieb Hai Nguyen FUB:
>
>  Dear Pdfbox-developers,
>>
>> My name is Hai and I am a java developer at the Freie Universtät Berlin.
>>
>> I am currently working on a project, which deals with converting pdf files
>> to images. I have looked around and found the Pdfbox library to be a good
>> pdf handling tool.
>>
>> After awhile working with this tool, I got stucked on a problem: whenever
>> I
>> tried to convert a handwritten pdf file, which means those files are
>> handwritten documents and were scanned and exported to pdf files (I do not
>> have the original images files), I received the following errors:
>>
>> 16:54:08,965 ERROR [FlateFilter] FlateFilter: stop reading corrupt stream
>>
>>> due to a DataFormatException
>>>
>>
>>
>> could you give me a hint, how to solve it?
>>
> Without having a hand on a sample pdf I'm just guessing. Try the
> non-sequential
> parser by using loadNonSeq() instead of load() to load the pdf.
>
>  my code snapshot is in the following:
>>
>> PDDocument document = PDDocument.load(new
>>
>>> File("src/test/resources/pdf/**249scan.pdf"));
>>>
>>>
>>
>> @SuppressWarnings("unchecked")
>>
>>> List<PDPage> pages = document.getDocumentCatalog().**getAllPages();
>>>
>>> PDPage page = pages.get(0);
>>> BufferedImage bi = page.convertToImage();
>>> ImageIO.write(bi, "png", new File("src/test/resources/pdf/**test.png"));
>>>
>>
>>
>>
>> Thank you in advance & Best regards
>>
>> --
>> Hai Nguyen
>>
>> Freie Universität Berlin
>> FB Mathematik u. Informatik
>> AG Intelligente Systeme und
>> Robotik<http://inf.fu-berlin.**de/groups/ag-ki/index.html<http://inf.fu-berlin.de/groups/ag-ki/index.html>
>> >
>>
>> Arnimallee 7, Raum 111
>> D-14195 Berlin
>>
>> Tel-1: +49 / 30 838 75114 ( Arnimallee - FUB)
>> Tel-2: +49 / 30 838 75148 (Takustr - FUB)
>> Tel-3: +49 / 30 2093 6381 (Office upstaris - HUB)
>> Tel-4: +49 / 30 2093 6393 (Lab downstairs - HUB)
>> Fax:    +49 / 30 838 75059
>> __________________________
>>
>
> BR
> Andreas Lehmkühler
>

Re: handwritten PDF file to Image not working

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 01.07.2013 17:06, schrieb Hai Nguyen FUB:
> Dear Pdfbox-developers,
>
> My name is Hai and I am a java developer at the Freie Universtät Berlin.
>
> I am currently working on a project, which deals with converting pdf files
> to images. I have looked around and found the Pdfbox library to be a good
> pdf handling tool.
>
> After awhile working with this tool, I got stucked on a problem: whenever I
> tried to convert a handwritten pdf file, which means those files are
> handwritten documents and were scanned and exported to pdf files (I do not
> have the original images files), I received the following errors:
>
> 16:54:08,965 ERROR [FlateFilter] FlateFilter: stop reading corrupt stream
>> due to a DataFormatException
>
>
> could you give me a hint, how to solve it?
Without having a hand on a sample pdf I'm just guessing. Try the non-sequential
parser by using loadNonSeq() instead of load() to load the pdf.

> my code snapshot is in the following:
>
> PDDocument document = PDDocument.load(new
>> File("src/test/resources/pdf/249scan.pdf"));
>>
>
>
> @SuppressWarnings("unchecked")
>> List<PDPage> pages = document.getDocumentCatalog().getAllPages();
>>
>> PDPage page = pages.get(0);
>> BufferedImage bi = page.convertToImage();
>> ImageIO.write(bi, "png", new File("src/test/resources/pdf/test.png"));
>
>
>
> Thank you in advance & Best regards
>
> --
> Hai Nguyen
>
> Freie Universität Berlin
> FB Mathematik u. Informatik
> AG Intelligente Systeme und
> Robotik<http://inf.fu-berlin.de/groups/ag-ki/index.html>
> Arnimallee 7, Raum 111
> D-14195 Berlin
>
> Tel-1: +49 / 30 838 75114 ( Arnimallee - FUB)
> Tel-2: +49 / 30 838 75148 (Takustr - FUB)
> Tel-3: +49 / 30 2093 6381 (Office upstaris - HUB)
> Tel-4: +49 / 30 2093 6393 (Lab downstairs - HUB)
> Fax:    +49 / 30 838 75059
> __________________________

BR
Andreas Lehmkühler