You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Damien Levasseur <d....@cykia.com> on 2019/11/13 08:33:51 UTC

Extract images and get occurrence of same image

Hello all,

When i extract images (version 2.0.17, using PDResources, COSName, 
PDXObject, PDImageXObject), i correctly get all distinct images, but 
same image is extracted only once. In the pdf file i'm trying to work 
on, there is one image repeated 3 times, and i wanted to get that.

How can i get a list of resources instead of Dictionary? Or get number 
of occurence or position of a repeated image?

Thanks

-- 
Regards,

*Damien LEVASSEUR*

Re: Extract images and get occurrence of same image

Posted by "Brian L. Matthews" <bl...@gmail.com>.

On 11/13/19 11:24 AM, Tilman Hausherr wrote:
> Am 13.11.2019 um 09:33 schrieb Damien Levasseur:
>> Hello all,
>>
>> When i extract images (version 2.0.17, using PDResources, COSName, 
>> PDXObject, PDImageXObject), i correctly get all distinct images, but 
>> same image is extracted only once. In the pdf file i'm trying to work 
>> on, there is one image repeated 3 times, and i wanted to get that.
>>
>> How can i get a list of resources instead of Dictionary? Or get 
>> number of occurence or position of a repeated image?
>>
>> Thanks
>>
>
> Hi,
>
> The easiest would be to take the source code of the ExtractImages 
> tool, and simply remove the duplicate check.
>
>                 if (seen.contains(xobject.getCOSObject()))
>                 {
>                     // skip duplicate image
>                     return;
>                 }
>

Ah, cool. I suppose at the bottom it's doing what my code does, but 
instead lets PDFBox do most of the work.

Brian

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Extract images and get occurrence of same image

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 13.11.2019 um 09:33 schrieb Damien Levasseur:
> Hello all,
>
> When i extract images (version 2.0.17, using PDResources, COSName, 
> PDXObject, PDImageXObject), i correctly get all distinct images, but 
> same image is extracted only once. In the pdf file i'm trying to work 
> on, there is one image repeated 3 times, and i wanted to get that.
>
> How can i get a list of resources instead of Dictionary? Or get number 
> of occurence or position of a repeated image?
>
> Thanks
>

Hi,

The easiest would be to take the source code of the ExtractImages tool, 
and simply remove the duplicate check.

                 if (seen.contains(xobject.getCOSObject()))
                 {
                     // skip duplicate image
                     return;
                 }


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Extract images and get occurrence of same image

Posted by "Brian L. Matthews" <bl...@gmail.com>.

[Oops, replied directly to Damien last time so adding back in the list 
so everyone can make derisive comments about my code :-), and maybe 
suggest a better approach.]

There may be a better way to do it, but I get a page as a stream, then 
iterate over the stream. Here's some code I ripped out of some other 
stuff I have:

PDPage                  page = doc.getPage(1);
PDFStreamParser    parser = new PDFStreamParser(page);
List<COSBase>      operands = new ArrayList<COSBase>();
Object                    token;

while ((token = parser.parseNextToken()) != null)
{
     if (token instanceof COSBase)
     {
         operands.add((COSBase) token);

         continue;
     }

     if (!(token instanceof Operator))
         throw new IllegalArgumentException("Unknown token " + token);

     String      opName = ((Operator) token).getName();

     if (opName.equals("Do")) // Draw object
         System.out.println("Invoke XObject <" + ((COSName) 
operands.get(0)).getName() + ">");

     operands.clear();
}

If I drop that in some code to open a file and parse it as a PDF 
document, when run on your document it outputs:

Invoke XObject <Im41>
Invoke XObject <Im41>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im45>

So it draws the first image twice, the second image 3 times, and the 
third image once. If all you need is a count, just use a Map of some 
sort to count the occurrences. If you actually need to know where 
they're drawn, that's harder, and you'll basically have to parse rest of 
the operators and track things like graphics states and transformation 
matrices. For example, the few operators before the first Do operator to 
draw Im41 are:

           q  - Save graphics state
             gs - Set graphics state to <GSa>
             cm - Concat
               [Scale_X:    0.333333, Shear_X:    0.000000, 0
                Shear_Y:    0.000000, Scale_Y:    0.333333, 0
                Offset_X: 145.000000, Offset_Y: 843.000000, 1]
               to transformation matrix
             cs - Set non-stoking color space to color space <CSp>
             scn - Set non-stroking color to <0, 0, 0>
             gs - Set graphics state to <GSa>
             cm - Concat
               [Scale_X:   12.000000, Shear_X:    0.000000, 0
                Shear_Y:    0.000000, Scale_Y:  -15.000000, 0
                Offset_X:   0.000000, Offset_Y:  15.000000, 1]
               to transformation matrix
             Do - Invoke XObject <Im41>

And that's inside other graphics state transformations.

As I said, this may not be the best way to do this, but it works, so 
that's one advantage. :-) It was also written against an older version 
of PDFBox so there may be things in newer versions that would help. 
Anyways, it should give you a start.

Brian

On 11/13/19 10:51 AM, Damien Levasseur wrote:
>
> Thank you for your quick answer, here is the document, and i need 
> yellow cards on page 2.
>
> How do you suggest to iterate the document? because a loop on 
> resources only provided one instance of image.
>
> This is how i use it :
>
> getImagesFromResources(document.getPage(1).getResources());
>
> void getImagesFromResources(PDResources pdResources) throws IOException {
>     String dstPath = CybeleConfig.getPath() + "/local/uploaded/tmp/";
>     int imgIndex = 1;
>     for (COSName name : pdResources.getXObjectNames()) {
>         PDXObject xObject = pdResources.getXObject(name);
>
>         if (xObject instanceof PDFormXObject) {
>             getImagesFromResources(((PDFormXObject) 
> xObject).getResources());
>
>         } else if (xObject instanceof PDImageXObject) {
>             PDImageXObject image = (PDImageXObject)xObject;
>
>             String filename = dstPath + "extracted-image-" + imgIndex 
> + ".png";
>             ImageIO.write(image.getImage(), "png", new File(filename));
>             imgIndex++;
>         }
>     }
> }
>
> Thank you for your help
>
> Le 13/11/2019 à 18:35, Brian L. Matthews a écrit :
>> On 11/13/19 12:33 AM, Damien Levasseur wrote:
>>> Hello all,
>>>
>>> When i extract images (version 2.0.17, using PDResources, COSName, 
>>> PDXObject, PDImageXObject), i correctly get all distinct images, but 
>>> same image is extracted only once. In the pdf file i'm trying to 
>>> work on, there is one image repeated 3 times, and i wanted to get that.
>>>
>>> How can i get a list of resources instead of Dictionary? Or get 
>>> number of occurence or position of a repeated image?
>>>
>>> Thanks
>>>
>>
>> This is partially a guess, but I'm assuming whatever wrote the PDF 
>> did that as a size optimization, and there isn't any way to know how 
>> many times an image is referenced without iterating over the 
>> document. As far as I know, there are no "back-references" associated 
>> with a resource pointing to everywhere it's used.
>>
>> Brian
>>
> -- 
> Regards,
>
> *Damien LEVASSEUR*
> Software engineer
> Ingénieur Développeur
>
> ------------------------------------------------------------------------
> *EdenWeb*
> 55bis Rue de Rennes
> 35510 Cesson-Sévigné 	Phone: +33 2 99 83 03 05
> E-mail: support@edenweb.fr <ma...@edenweb.fr>
> Website: www.edenweb.fr <http://www.edenweb.fr>
>
>
>