You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Alexander Klenner <al...@scai.fraunhofer.de> on 2013/04/08 08:52:06 UTC

errors with PDPage.convertToImage()

Hi all,

I frequently come across PDFs where the convertToImage() method is generating blank or partly blank images. One of those PDFs is attached to this mail. 

My code for processing: 

PDFParser parser;
parser = new PDFParser(new FileInputStream(f));
parser.parse();
cosDoc = parser.getDocument();

pdDoc = new PDDocument(cosDoc);
..
Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
PDPage page = it.next();
...
PDRectangle cropBox = page.findCropBox();
Dimension dimension = cropBox.createDimension();
...
BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, ImageParser.PARAM_DPI);


I am using pdfbox-app-1.8.0.jar.

So I have two questions: 

1. Is there a different way to to extract the page as an image that I am not aware of to get the correct image? 
2. Or is it possible to detect, that this page was not extracted correctly before or after the extraction?

At the moment I just don't know when dealing with a corrupted image.

Thanks a lot for any hints,

Alex

--
Dr. Alexander G. Klenner
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven, D-53754 Sankt Augustin
Tel.: +49 - 2241 - 14 - 2736
E-mail: alexander.garvin.klenner@scai.fraunhofer.de
Internet: http://www.scai.fraunhofer.de

Re: errors with PDPage.convertToImage()

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Alexander,

you can ignore the info messages if the result you get is inline with your expectations. The info means that although PDFBox supports a fair amount of the PDF specification not all operators specified are currently supported. PDFBox handles that situation and continues processing the rest of the PDF. As long as that doesn't affect the results you are expecting you're fine.

BR
Maruan Sahyoun

Am 08.04.2013 um 10:17 schrieb Alexander Klenner <al...@scai.fraunhofer.de>:

> Hi Andreas,
> 
> sorry I was busy uploading the PDFs and writing the mail, didn't see your mail, but I figured PDFToImage might be the correct choice here ;). 
> 
> I do not get any exceptions but some info logs, which are:
> 
> Apr 8, 2013 10:16:49 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: BX
> Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: BDC
> Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: BMC
> Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: i
> Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: DP
> Apr 8, 2013 10:16:51 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: EMC
> Apr 8, 2013 10:16:52 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: EX
> 
> 
> Those I get for every page in this document. 
> 
> Cheers,
> 
> Alex
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> Internet: http://www.scai.fraunhofer.de
> 
> 
> ----- Original Message -----
> From: "Andreas Lehmkühler" <an...@lehmi.de>
> To: users@pdfbox.apache.org
> Sent: Monday, April 8, 2013 9:58:25 AM
> Subject: Re: errors with PDPage.convertToImage()
> 
> Hi,
> 
> Maruan Sahyoun <sa...@fileaffairs.de> hat am 8. April 2013 um 09:20
> geschrieben:
>> Hi,
>> 
>> unfortunately the attachment didn't make it through.
> Due to some security restrictions.
> 
>> Could you try the PDF in question using the command line app ExtractImage with
>> the -nonSeq  parameter or use the following code
> I guess there is a missunderstanding. Please use PDFToImage to create one image
> for
> each page [1]. Provide us with any possible exception or log.
> 
>> PDDocument pdDoc = PDDocument.loadNonSeq(…)
>> 
>> The NonSequentialParser gives better results if the document has incremental
>> updates.
>> In addition it's not necessary to create a new PDDocument from the cosDoc as
>> parser.getDocument already passes a PDDocument ….
> +1, that's an old pattern and should be used any more.
> 
>> BR from you neighborhood
> I'm not that far away either ;-)
> 
>> Maruan Sahyoun
>> 
>> Am 08.04.2013 um 08:52 schrieb Alexander Klenner
>> <al...@scai.fraunhofer.de>:
>> 
>>> Hi all,
>>> 
>>> I frequently come across PDFs where the convertToImage() method is
>>> generating blank or partly blank images. One of those PDFs is attached to
>>> this mail.
>>> 
>>> My code for processing:
>>> 
>>> PDFParser parser;
>>> parser = new PDFParser(new FileInputStream(f));
>>> parser.parse();
>>> cosDoc = parser.getDocument();
>>> 
>>> pdDoc = new PDDocument(cosDoc);
>>> ..
>>> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
>>> PDPage page = it.next();
>>> ...
>>> PDRectangle cropBox = page.findCropBox();
>>> Dimension dimension = cropBox.createDimension();
>>> ...
>>> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB,
>>> ImageParser.PARAM_DPI);
>>> 
>>> 
>>> I am using pdfbox-app-1.8.0.jar.
>>> 
>>> So I have two questions:
>>> 
>>> 1. Is there a different way to to extract the page as an image that I am not
>>> aware of to get the correct image?
>>> 2. Or is it possible to detect, that this page was not extracted correctly
>>> before or after the extraction?
>>> 
>>> At the moment I just don't know when dealing with a corrupted image.
>>> 
>>> Thanks a lot for any hints,
>>> 
>>> Alex
>>> 
>>> --
>>> Dr. Alexander G. Klenner
>>> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
>>> Schloss Birlinghoven, D-53754 Sankt Augustin
>>> Tel.: +49 - 2241 - 14 - 2736
>>> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
>>> Internet: http://www.scai.fraunhofer.de
>>> 
> 
> BR
> Andreas Lehmkühler
> 
> [1] http://pdfbox.apache.org/commandlineutilities/PDFToImage.html

Re: errors with PDPage.convertToImage()

Posted by Alexander Klenner <al...@scai.fraunhofer.de>.

Hi Andreas,

sorry I was busy uploading the PDFs and writing the mail, didn't see your mail, but I figured PDFToImage might be the correct choice here ;). 

I do not get any exceptions but some info logs, which are:

Apr 8, 2013 10:16:49 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BX
Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BMC
Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: DP
Apr 8, 2013 10:16:51 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Apr 8, 2013 10:16:52 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EX


Those I get for every page in this document. 

Cheers,

Alex

--
Dr. Alexander G. Klenner
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven, D-53754 Sankt Augustin
Tel.: +49 - 2241 - 14 - 2736
E-mail: alexander.garvin.klenner@scai.fraunhofer.de
Internet: http://www.scai.fraunhofer.de


----- Original Message -----
From: "Andreas Lehmkühler" <an...@lehmi.de>
To: users@pdfbox.apache.org
Sent: Monday, April 8, 2013 9:58:25 AM
Subject: Re: errors with PDPage.convertToImage()

Hi,

Maruan Sahyoun <sa...@fileaffairs.de> hat am 8. April 2013 um 09:20
geschrieben:
> Hi,
>
> unfortunately the attachment didn't make it through.
Due to some security restrictions.

> Could you try the PDF in question using the command line app ExtractImage with
> the -nonSeq  parameter or use the following code
I guess there is a missunderstanding. Please use PDFToImage to create one image
for
each page [1]. Provide us with any possible exception or log.

> PDDocument pdDoc = PDDocument.loadNonSeq(…)
>
> The NonSequentialParser gives better results if the document has incremental
> updates.
> In addition it's not necessary to create a new PDDocument from the cosDoc as
> parser.getDocument already passes a PDDocument ….
+1, that's an old pattern and should be used any more.

> BR from you neighborhood
I'm not that far away either ;-)

> Maruan Sahyoun
>
> Am 08.04.2013 um 08:52 schrieb Alexander Klenner
> <al...@scai.fraunhofer.de>:
>
> > Hi all,
> >
> > I frequently come across PDFs where the convertToImage() method is
> > generating blank or partly blank images. One of those PDFs is attached to
> > this mail.
> >
> > My code for processing:
> >
> > PDFParser parser;
> > parser = new PDFParser(new FileInputStream(f));
> > parser.parse();
> > cosDoc = parser.getDocument();
> >
> > pdDoc = new PDDocument(cosDoc);
> > ..
> > Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
> > PDPage page = it.next();
> > ...
> > PDRectangle cropBox = page.findCropBox();
> > Dimension dimension = cropBox.createDimension();
> > ...
> > BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB,
> > ImageParser.PARAM_DPI);
> >
> >
> > I am using pdfbox-app-1.8.0.jar.
> >
> > So I have two questions:
> >
> > 1. Is there a different way to to extract the page as an image that I am not
> > aware of to get the correct image?
> > 2. Or is it possible to detect, that this page was not extracted correctly
> > before or after the extraction?
> >
> > At the moment I just don't know when dealing with a corrupted image.
> >
> > Thanks a lot for any hints,
> >
> > Alex
> >
> > --
> > Dr. Alexander G. Klenner
> > Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> > Schloss Birlinghoven, D-53754 Sankt Augustin
> > Tel.: +49 - 2241 - 14 - 2736
> > E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> > Internet: http://www.scai.fraunhofer.de
> >

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/commandlineutilities/PDFToImage.html

Re: errors with PDPage.convertToImage()

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

Maruan Sahyoun <sa...@fileaffairs.de> hat am 8. April 2013 um 09:20
geschrieben:
> Hi,
>
> unfortunately the attachment didn't make it through.
Due to some security restrictions.

> Could you try the PDF in question using the command line app ExtractImage with
> the -nonSeq  parameter or use the following code
I guess there is a missunderstanding. Please use PDFToImage to create one image
for
each page [1]. Provide us with any possible exception or log.

> PDDocument pdDoc = PDDocument.loadNonSeq(…)
>
> The NonSequentialParser gives better results if the document has incremental
> updates.
> In addition it's not necessary to create a new PDDocument from the cosDoc as
> parser.getDocument already passes a PDDocument ….
+1, that's an old pattern and should be used any more.

> BR from you neighborhood
I'm not that far away either ;-)

> Maruan Sahyoun
>
> Am 08.04.2013 um 08:52 schrieb Alexander Klenner
> <al...@scai.fraunhofer.de>:
>
> > Hi all,
> >
> > I frequently come across PDFs where the convertToImage() method is
> > generating blank or partly blank images. One of those PDFs is attached to
> > this mail.
> >
> > My code for processing:
> >
> > PDFParser parser;
> > parser = new PDFParser(new FileInputStream(f));
> > parser.parse();
> > cosDoc = parser.getDocument();
> >
> > pdDoc = new PDDocument(cosDoc);
> > ..
> > Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
> > PDPage page = it.next();
> > ...
> > PDRectangle cropBox = page.findCropBox();
> > Dimension dimension = cropBox.createDimension();
> > ...
> > BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB,
> > ImageParser.PARAM_DPI);
> >
> >
> > I am using pdfbox-app-1.8.0.jar.
> >
> > So I have two questions:
> >
> > 1. Is there a different way to to extract the page as an image that I am not
> > aware of to get the correct image?
> > 2. Or is it possible to detect, that this page was not extracted correctly
> > before or after the extraction?
> >
> > At the moment I just don't know when dealing with a corrupted image.
> >
> > Thanks a lot for any hints,
> >
> > Alex
> >
> > --
> > Dr. Alexander G. Klenner
> > Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> > Schloss Birlinghoven, D-53754 Sankt Augustin
> > Tel.: +49 - 2241 - 14 - 2736
> > E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> > Internet: http://www.scai.fraunhofer.de
> >

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/commandlineutilities/PDFToImage.html

Re: errors with PDPage.convertToImage()

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

could you also try the PDFToImage command as Andreas suggested (and I actually meant) as this will convert a PDF to Image page by page. ExtractImage extracts the images on the page but doesn't deal with text, line art ….

I will take a quick look at the sample you provided.

BR

Maruan Sahyoun

Am 08.04.2013 um 10:14 schrieb Alexander Klenner <al...@scai.fraunhofer.de>:

> Hi Maruan,
> 
> thank you, I now do have a first clue what is happening, as you suggested I used the command line with the ExtractImages command, which leads to many Images, those are actually the same, that I see on my created convertToImage() pages.
> 
> Using the ExtractText method from the cml, I get all the text from this PDF. 
> So somehow convertToImage() for this particular PDF seems to only return the results from "ExtractImages".
> I also tried PDFToImage using the nonSeq parameter, this method returns exactly the semi-empty pages that my java code produces. 
> 
> So I conclude for some PDFs convertToImage() returns text+images for some it only returns images. Is this the expected behaviour? 
> 
> All PDFs I process have 'real' text, which is selectable and that is not covered by an ImageLayer of text of some sort (at least I think so). 
> 
> I uploaded the PDF and the output of PDFToImage to https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt
> 
> Cheers,
> 
> Alex
> 
> 
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> Internet: http://www.scai.fraunhofer.de
> 
> 
> ----- Original Message -----
> From: "Maruan Sahyoun" <sa...@fileaffairs.de>
> To: users@pdfbox.apache.org
> Sent: Monday, April 8, 2013 9:20:10 AM
> Subject: Re: errors with PDPage.convertToImage()
> 
> Hi,
> 
> unfortunately the attachment didn't make it through.
> 
> Could you try the PDF in question using the command line app ExtractImage with the -nonSeq  parameter or use the following code
> 
> PDDocument pdDoc = PDDocument.loadNonSeq(…)
> 
> The NonSequentialParser gives better results if the document has incremental updates. In addition it's not necessary to create a new PDDocument from the cosDoc as parser.getDocument already passes a PDDocument ….
> 
> BR from you neighborhood
> 
> 
> Maruan Sahyoun
> 
> Am 08.04.2013 um 08:52 schrieb Alexander Klenner <al...@scai.fraunhofer.de>:
> 
>> Hi all,
>> 
>> I frequently come across PDFs where the convertToImage() method is generating blank or partly blank images. One of those PDFs is attached to this mail. 
>> 
>> My code for processing: 
>> 
>> PDFParser parser;
>> parser = new PDFParser(new FileInputStream(f));
>> parser.parse();
>> cosDoc = parser.getDocument();
>> 
>> pdDoc = new PDDocument(cosDoc);
>> ..
>> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
>> PDPage page = it.next();
>> ...
>> PDRectangle cropBox = page.findCropBox();
>> Dimension dimension = cropBox.createDimension();
>> ...
>> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, ImageParser.PARAM_DPI);
>> 
>> 
>> I am using pdfbox-app-1.8.0.jar.
>> 
>> So I have two questions: 
>> 
>> 1. Is there a different way to to extract the page as an image that I am not aware of to get the correct image? 
>> 2. Or is it possible to detect, that this page was not extracted correctly before or after the extraction?
>> 
>> At the moment I just don't know when dealing with a corrupted image.
>> 
>> Thanks a lot for any hints,
>> 
>> Alex
>> 
>> --
>> Dr. Alexander G. Klenner
>> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
>> Schloss Birlinghoven, D-53754 Sankt Augustin
>> Tel.: +49 - 2241 - 14 - 2736
>> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
>> Internet: http://www.scai.fraunhofer.de
>>

Re: errors with PDPage.convertToImage()

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 08.04.2013 10:14, schrieb Alexander Klenner:
> Hi Maruan,
>
> thank you, I now do have a first clue what is happening, as you suggested I used the command line with the ExtractImages command, which leads to many Images, those are actually the same, that I see on my created convertToImage() pages.
>
> Using the ExtractText method from the cml, I get all the text from this PDF.
> So somehow convertToImage() for this particular PDF seems to only return the results from "ExtractImages".
> I also tried PDFToImage using the nonSeq parameter, this method returns exactly the semi-empty pages that my java code produces.
>
> So I conclude for some PDFs convertToImage() returns text+images for some it only returns images. Is this the expected behaviour?
>
> All PDFs I process have 'real' text, which is selectable and that is not covered by an ImageLayer of text of some sort (at least I think so).
>
> I uploaded the PDF and the output of PDFToImage to https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt
I ran a quick test and I can confirm the described behaviour. There aren't any
exceptions or other obvious issues. It looks like the embedded type1 fonts are
somehow problematic. But for now I don't have any clue why.

> Cheers,
>
> Alex

BR
Andreas Lehmkühler

Re: errors with PDPage.convertToImage()

Posted by Alexander Klenner <al...@scai.fraunhofer.de>.

Hi Maruan,

thank you, I now do have a first clue what is happening, as you suggested I used the command line with the ExtractImages command, which leads to many Images, those are actually the same, that I see on my created convertToImage() pages.

Using the ExtractText method from the cml, I get all the text from this PDF. 
So somehow convertToImage() for this particular PDF seems to only return the results from "ExtractImages".
I also tried PDFToImage using the nonSeq parameter, this method returns exactly the semi-empty pages that my java code produces. 

So I conclude for some PDFs convertToImage() returns text+images for some it only returns images. Is this the expected behaviour? 

All PDFs I process have 'real' text, which is selectable and that is not covered by an ImageLayer of text of some sort (at least I think so). 

I uploaded the PDF and the output of PDFToImage to https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt

Cheers,

Alex



--
Dr. Alexander G. Klenner
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven, D-53754 Sankt Augustin
Tel.: +49 - 2241 - 14 - 2736
E-mail: alexander.garvin.klenner@scai.fraunhofer.de
Internet: http://www.scai.fraunhofer.de


----- Original Message -----
From: "Maruan Sahyoun" <sa...@fileaffairs.de>
To: users@pdfbox.apache.org
Sent: Monday, April 8, 2013 9:20:10 AM
Subject: Re: errors with PDPage.convertToImage()

Hi,

unfortunately the attachment didn't make it through.

Could you try the PDF in question using the command line app ExtractImage with the -nonSeq  parameter or use the following code

PDDocument pdDoc = PDDocument.loadNonSeq(…)

The NonSequentialParser gives better results if the document has incremental updates. In addition it's not necessary to create a new PDDocument from the cosDoc as parser.getDocument already passes a PDDocument ….

BR from you neighborhood


Maruan Sahyoun

Am 08.04.2013 um 08:52 schrieb Alexander Klenner <al...@scai.fraunhofer.de>:

> Hi all,
> 
> I frequently come across PDFs where the convertToImage() method is generating blank or partly blank images. One of those PDFs is attached to this mail. 
> 
> My code for processing: 
> 
> PDFParser parser;
> parser = new PDFParser(new FileInputStream(f));
> parser.parse();
> cosDoc = parser.getDocument();
> 
> pdDoc = new PDDocument(cosDoc);
> ..
> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
> PDPage page = it.next();
> ...
> PDRectangle cropBox = page.findCropBox();
> Dimension dimension = cropBox.createDimension();
> ...
> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, ImageParser.PARAM_DPI);
> 
> 
> I am using pdfbox-app-1.8.0.jar.
> 
> So I have two questions: 
> 
> 1. Is there a different way to to extract the page as an image that I am not aware of to get the correct image? 
> 2. Or is it possible to detect, that this page was not extracted correctly before or after the extraction?
> 
> At the moment I just don't know when dealing with a corrupted image.
> 
> Thanks a lot for any hints,
> 
> Alex
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> Internet: http://www.scai.fraunhofer.de
>

Re: errors with PDPage.convertToImage()

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

unfortunately the attachment didn't make it through.

Could you try the PDF in question using the command line app ExtractImage with the -nonSeq  parameter or use the following code

PDDocument pdDoc = PDDocument.loadNonSeq(…)

The NonSequentialParser gives better results if the document has incremental updates. In addition it's not necessary to create a new PDDocument from the cosDoc as parser.getDocument already passes a PDDocument ….

BR from you neighborhood


Maruan Sahyoun

Am 08.04.2013 um 08:52 schrieb Alexander Klenner <al...@scai.fraunhofer.de>:

> Hi all,
> 
> I frequently come across PDFs where the convertToImage() method is generating blank or partly blank images. One of those PDFs is attached to this mail. 
> 
> My code for processing: 
> 
> PDFParser parser;
> parser = new PDFParser(new FileInputStream(f));
> parser.parse();
> cosDoc = parser.getDocument();
> 
> pdDoc = new PDDocument(cosDoc);
> ..
> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
> PDPage page = it.next();
> ...
> PDRectangle cropBox = page.findCropBox();
> Dimension dimension = cropBox.createDimension();
> ...
> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, ImageParser.PARAM_DPI);
> 
> 
> I am using pdfbox-app-1.8.0.jar.
> 
> So I have two questions: 
> 
> 1. Is there a different way to to extract the page as an image that I am not aware of to get the correct image? 
> 2. Or is it possible to detect, that this page was not extracted correctly before or after the extraction?
> 
> At the moment I just don't know when dealing with a corrupted image.
> 
> Thanks a lot for any hints,
> 
> Alex
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> Internet: http://www.scai.fraunhofer.de
>