You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Damien Levasseur <d....@cykia.com> on 2019/11/13 08:33:51 UTC
Extract images and get occurrence of same image
Hello all,
When i extract images (version 2.0.17, using PDResources, COSName,
PDXObject, PDImageXObject), i correctly get all distinct images, but
same image is extracted only once. In the pdf file i'm trying to work
on, there is one image repeated 3 times, and i wanted to get that.
How can i get a list of resources instead of Dictionary? Or get number
of occurence or position of a repeated image?
Thanks
--
Regards,
*Damien LEVASSEUR*
Re: Extract images and get occurrence of same image
Posted by "Brian L. Matthews" <bl...@gmail.com>.
On 11/13/19 11:24 AM, Tilman Hausherr wrote:
> Am 13.11.2019 um 09:33 schrieb Damien Levasseur:
>> Hello all,
>>
>> When i extract images (version 2.0.17, using PDResources, COSName,
>> PDXObject, PDImageXObject), i correctly get all distinct images, but
>> same image is extracted only once. In the pdf file i'm trying to work
>> on, there is one image repeated 3 times, and i wanted to get that.
>>
>> How can i get a list of resources instead of Dictionary? Or get
>> number of occurence or position of a repeated image?
>>
>> Thanks
>>
>
> Hi,
>
> The easiest would be to take the source code of the ExtractImages
> tool, and simply remove the duplicate check.
>
> if (seen.contains(xobject.getCOSObject()))
> {
> // skip duplicate image
> return;
> }
>
Ah, cool. I suppose at the bottom it's doing what my code does, but
instead lets PDFBox do most of the work.
Brian
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: Extract images and get occurrence of same image
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 13.11.2019 um 09:33 schrieb Damien Levasseur:
> Hello all,
>
> When i extract images (version 2.0.17, using PDResources, COSName,
> PDXObject, PDImageXObject), i correctly get all distinct images, but
> same image is extracted only once. In the pdf file i'm trying to work
> on, there is one image repeated 3 times, and i wanted to get that.
>
> How can i get a list of resources instead of Dictionary? Or get number
> of occurence or position of a repeated image?
>
> Thanks
>
Hi,
The easiest would be to take the source code of the ExtractImages tool,
and simply remove the duplicate check.
if (seen.contains(xobject.getCOSObject()))
{
// skip duplicate image
return;
}
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: Extract images and get occurrence of same image
Posted by "Brian L. Matthews" <bl...@gmail.com>.
[Oops, replied directly to Damien last time so adding back in the list
so everyone can make derisive comments about my code :-), and maybe
suggest a better approach.]
There may be a better way to do it, but I get a page as a stream, then
iterate over the stream. Here's some code I ripped out of some other
stuff I have:
PDPage page = doc.getPage(1);
PDFStreamParser parser = new PDFStreamParser(page);
List<COSBase> operands = new ArrayList<COSBase>();
Object token;
while ((token = parser.parseNextToken()) != null)
{
if (token instanceof COSBase)
{
operands.add((COSBase) token);
continue;
}
if (!(token instanceof Operator))
throw new IllegalArgumentException("Unknown token " + token);
String opName = ((Operator) token).getName();
if (opName.equals("Do")) // Draw object
System.out.println("Invoke XObject <" + ((COSName)
operands.get(0)).getName() + ">");
operands.clear();
}
If I drop that in some code to open a file and parse it as a PDF
document, when run on your document it outputs:
Invoke XObject <Im41>
Invoke XObject <Im41>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im45>
So it draws the first image twice, the second image 3 times, and the
third image once. If all you need is a count, just use a Map of some
sort to count the occurrences. If you actually need to know where
they're drawn, that's harder, and you'll basically have to parse rest of
the operators and track things like graphics states and transformation
matrices. For example, the few operators before the first Do operator to
draw Im41 are:
q - Save graphics state
gs - Set graphics state to <GSa>
cm - Concat
[Scale_X: 0.333333, Shear_X: 0.000000, 0
Shear_Y: 0.000000, Scale_Y: 0.333333, 0
Offset_X: 145.000000, Offset_Y: 843.000000, 1]
to transformation matrix
cs - Set non-stoking color space to color space <CSp>
scn - Set non-stroking color to <0, 0, 0>
gs - Set graphics state to <GSa>
cm - Concat
[Scale_X: 12.000000, Shear_X: 0.000000, 0
Shear_Y: 0.000000, Scale_Y: -15.000000, 0
Offset_X: 0.000000, Offset_Y: 15.000000, 1]
to transformation matrix
Do - Invoke XObject <Im41>
And that's inside other graphics state transformations.
As I said, this may not be the best way to do this, but it works, so
that's one advantage. :-) It was also written against an older version
of PDFBox so there may be things in newer versions that would help.
Anyways, it should give you a start.
Brian
On 11/13/19 10:51 AM, Damien Levasseur wrote:
>
> Thank you for your quick answer, here is the document, and i need
> yellow cards on page 2.
>
> How do you suggest to iterate the document? because a loop on
> resources only provided one instance of image.
>
> This is how i use it :
>
> getImagesFromResources(document.getPage(1).getResources());
>
> void getImagesFromResources(PDResources pdResources) throws IOException {
> String dstPath = CybeleConfig.getPath() + "/local/uploaded/tmp/";
> int imgIndex = 1;
> for (COSName name : pdResources.getXObjectNames()) {
> PDXObject xObject = pdResources.getXObject(name);
>
> if (xObject instanceof PDFormXObject) {
> getImagesFromResources(((PDFormXObject)
> xObject).getResources());
>
> } else if (xObject instanceof PDImageXObject) {
> PDImageXObject image = (PDImageXObject)xObject;
>
> String filename = dstPath + "extracted-image-" + imgIndex
> + ".png";
> ImageIO.write(image.getImage(), "png", new File(filename));
> imgIndex++;
> }
> }
> }
>
> Thank you for your help
>
> Le 13/11/2019 à 18:35, Brian L. Matthews a écrit :
>> On 11/13/19 12:33 AM, Damien Levasseur wrote:
>>> Hello all,
>>>
>>> When i extract images (version 2.0.17, using PDResources, COSName,
>>> PDXObject, PDImageXObject), i correctly get all distinct images, but
>>> same image is extracted only once. In the pdf file i'm trying to
>>> work on, there is one image repeated 3 times, and i wanted to get that.
>>>
>>> How can i get a list of resources instead of Dictionary? Or get
>>> number of occurence or position of a repeated image?
>>>
>>> Thanks
>>>
>>
>> This is partially a guess, but I'm assuming whatever wrote the PDF
>> did that as a size optimization, and there isn't any way to know how
>> many times an image is referenced without iterating over the
>> document. As far as I know, there are no "back-references" associated
>> with a resource pointing to everywhere it's used.
>>
>> Brian
>>
> --
> Regards,
>
> *Damien LEVASSEUR*
> Software engineer
> Ingénieur Développeur
>
> ------------------------------------------------------------------------
> *EdenWeb*
> 55bis Rue de Rennes
> 35510 Cesson-Sévigné Phone: +33 2 99 83 03 05
> E-mail: support@edenweb.fr <ma...@edenweb.fr>
> Website: www.edenweb.fr <http://www.edenweb.fr>
>
>
>