You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Dan Fulea <fu...@gmail.com> on 2020/10/01 05:33:00 UTC

Image extraction

Hello,
I am using pdfbox for handling PDFs and it is doing its job quite fine most
of the time.
However, I encounter a strange behaviour when extracting images embedded in
some PDFs.
I start with the following code (I think it is taken from one of yours
tutorials):
for (PDPage page : list) {
       PDResources pdResources = page.getResources();
       for (COSName c : pdResources.getXObjectNames()) {
           PDXObject o = pdResources.getXObject(c);
           if (o instanceof PDImageXObject) {
            imageCount++;
            WRITEIMAGE(o,....);//WRITING IMAGE TO DISK GOES HERE
           }
       }
}
This is clean, have logic and seems natural, but poses a problem:
The problem with this approach is that we always obtain DOUBLED images for
each one real image in PDF. One image is a good one, the other is some kind
of "negative" of the good one. Moreover, the images order (the image index
as they appear in PDF from top to bottom) are scrambled.

The second approach involve the following tutorial:
https://www.tutorialkart.com/pdfbox/how-to-get-location-and-size-of-images-in-pdf/

The image writting routine is done inside the processOperator method, just
before the following line:
System.out.println("\nImage [" + objectName.getName() + "]");
In this approach, we get the correct images count (no duplicates) and in
correct order. This is what I want and it is very very good,

Although those approaches look somehow similar, why the first one behaves
so strangely?
Which way do you recommend to extract the images?
I am uncomfortable not fully understanding all of these issues.

Please help me understand better, thank you,
Dan Fulea

Re: Image extraction

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

tutorialkart is not "our" website.

The order in the resources has nothing to do with the visual position.

The "negative" might be an image mask that is in the resources despite 
that it is not used directly (it may be used by an image to get a 
transparency effect)

The "second" approach is the one used by the ExtractImages.java tool 
which is available in the source code download. That would be the one to 
use.

Tilman



Am 01.10.2020 um 07:33 schrieb Dan Fulea:
> Hello,
> I am using pdfbox for handling PDFs and it is doing its job quite fine most
> of the time.
> However, I encounter a strange behaviour when extracting images embedded in
> some PDFs.
> I start with the following code (I think it is taken from one of yours
> tutorials):
> for (PDPage page : list) {
>         PDResources pdResources = page.getResources();
>         for (COSName c : pdResources.getXObjectNames()) {
>             PDXObject o = pdResources.getXObject(c);
>             if (o instanceof PDImageXObject) {
>              imageCount++;
>              WRITEIMAGE(o,....);//WRITING IMAGE TO DISK GOES HERE
>             }
>         }
> }
> This is clean, have logic and seems natural, but poses a problem:
> The problem with this approach is that we always obtain DOUBLED images for
> each one real image in PDF. One image is a good one, the other is some kind
> of "negative" of the good one. Moreover, the images order (the image index
> as they appear in PDF from top to bottom) are scrambled.
>
> The second approach involve the following tutorial:
> https://www.tutorialkart.com/pdfbox/how-to-get-location-and-size-of-images-in-pdf/
>
> The image writting routine is done inside the processOperator method, just
> before the following line:
> System.out.println("\nImage [" + objectName.getName() + "]");
> In this approach, we get the correct images count (no duplicates) and in
> correct order. This is what I want and it is very very good,
>
> Although those approaches look somehow similar, why the first one behaves
> so strangely?
> Which way do you recommend to extract the images?
> I am uncomfortable not fully understanding all of these issues.
>
> Please help me understand better, thank you,
> Dan Fulea
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org