You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by David Patterson <pa...@gmail.com> on 2017/04/06 19:22:21 UTC

Looking for a way to iterate over images in a PDF

I've got some PDF's to try to read. Many of them have images in them. I'd
like to be able to iterate over the images and determine their encoding
(png vs. jpeg vs. ?) and size.

I've found a sample that lets me iterate over the PDXObject entities, but
I'm missing a key piece to determine the size and format of the objects.

a) Is a PDXObject always an image, or could it be something else?

Here is the code I've got so far.

for ( PDPage aPage : pdfDocument.getPages() ) {
PDResources pdResources = aPage.getResources();
for ( COSName cosObject : pdResources.getXObjectNames() ) {
PDXObject xObj = pdResources.getXObject( cosObject);
System.out.println( "got an image maybe" );

This is where I've gotten stumped. I've looked at lots of lists of
COS-whatever things, but it has not led me to "the answer."

Thanks for any guidance you can provide.

Dave Patterson

Re: Looking for a way to iterate over images in a PDF

Posted by David Patterson <pa...@gmail.com>.

Tilman,

Thanks. That works perfectly. Now I need to go through it in detail to
figure out how it extracts the image and metadata.

Dave Patterson

On Fri, Apr 7, 2017 at 5:32 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 07.04.2017 um 22:59 schrieb David Patterson:
>
>> Tilman,
>>
>> The ExtractImages sample code is a 1.8 artifact (I believe). It has a lot
>> of errors when compiled with 2.0.5 libraries.
>>
>
> Please try this one:
> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/ja
> va/org/apache/pdfbox/tools/ExtractImages.java?view=markup
>
> Tilman
>
>
>
>> 1) two imports are no longer in the 2.0.5 library
>> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
>> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
>>
>> 2) missing methods or methods with different signatures:
>> PDDocument.loadNonSeq(                                            **
>> method
>> not define
>> PDDocument.load(                                                       **
>> load now requires a File, not a String
>> document.openProtection (
>> document.getDocumentCatalog().getAllPages()              ** getAllPages
>> is
>> missing from the PDDocumentCatalog
>> resources.getXObjects()                                               **
>> where resources is a PDResources object
>> if (xobject instanceof PDXObjectImage)                         **
>> PDXObjectImage is not defined
>> else if (xobject instanceof PDXObjectForm)                   ** same with
>> PDXObjectForm
>>
>> Maybe a new ExtractImages2 program needs to be developed for the PDFBox 2
>> era.
>>
>> Dave Patterson
>>
>>
>>
>>
>> On Thu, Apr 6, 2017 at 5:02 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Am 06.04.2017 um 21:22 schrieb David Patterson:
>>>
>>> I've got some PDF's to try to read. Many of them have images in them. I'd
>>>> like to be able to iterate over the images and determine their encoding
>>>> (png vs. jpeg vs. ?) and size.
>>>>
>>>> I've found a sample that lets me iterate over the PDXObject entities,
>>>> but
>>>> I'm missing a key piece to determine the size and format of the objects.
>>>>
>>>> a) Is a PDXObject always an image, or could it be something else?
>>>>
>>>> Yes it could be a form. That's why all examples (e.g.
>>> ExtractImages.java)
>>> always check the type, and the cast to the image xobject type. That one
>>> will give the size and the filters.
>>>
>>> Tilman
>>>
>>>
>>> Here is the code I've got so far.
>>>>
>>>> for ( PDPage aPage : pdfDocument.getPages() ) {
>>>> PDResources pdResources = aPage.getResources();
>>>> for ( COSName cosObject : pdResources.getXObjectNames() ) {
>>>> PDXObject xObj = pdResources.getXObject( cosObject);
>>>> System.out.println( "got an image maybe" );
>>>>
>>>> This is where I've gotten stumped. I've looked at lots of lists of
>>>> COS-whatever things, but it has not led me to "the answer."
>>>>
>>>> Thanks for any guidance you can provide.
>>>>
>>>> Dave Patterson
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Looking for a way to iterate over images in a PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 07.04.2017 um 22:59 schrieb David Patterson:
> Tilman,
>
> The ExtractImages sample code is a 1.8 artifact (I believe). It has a lot
> of errors when compiled with 2.0.5 libraries.

Please try this one:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup

Tilman

>
> 1) two imports are no longer in the 2.0.5 library
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
>
> 2) missing methods or methods with different signatures:
> PDDocument.loadNonSeq(                                            ** method
> not define
> PDDocument.load(                                                       **
> load now requires a File, not a String
> document.openProtection (
> document.getDocumentCatalog().getAllPages()              ** getAllPages is
> missing from the PDDocumentCatalog
> resources.getXObjects()                                               **
> where resources is a PDResources object
> if (xobject instanceof PDXObjectImage)                         **
> PDXObjectImage is not defined
> else if (xobject instanceof PDXObjectForm)                   ** same with
> PDXObjectForm
>
> Maybe a new ExtractImages2 program needs to be developed for the PDFBox 2
> era.
>
> Dave Patterson
>
>
>
>
> On Thu, Apr 6, 2017 at 5:02 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 06.04.2017 um 21:22 schrieb David Patterson:
>>
>>> I've got some PDF's to try to read. Many of them have images in them. I'd
>>> like to be able to iterate over the images and determine their encoding
>>> (png vs. jpeg vs. ?) and size.
>>>
>>> I've found a sample that lets me iterate over the PDXObject entities, but
>>> I'm missing a key piece to determine the size and format of the objects.
>>>
>>> a) Is a PDXObject always an image, or could it be something else?
>>>
>> Yes it could be a form. That's why all examples (e.g. ExtractImages.java)
>> always check the type, and the cast to the image xobject type. That one
>> will give the size and the filters.
>>
>> Tilman
>>
>>
>>> Here is the code I've got so far.
>>>
>>> for ( PDPage aPage : pdfDocument.getPages() ) {
>>> PDResources pdResources = aPage.getResources();
>>> for ( COSName cosObject : pdResources.getXObjectNames() ) {
>>> PDXObject xObj = pdResources.getXObject( cosObject);
>>> System.out.println( "got an image maybe" );
>>>
>>> This is where I've gotten stumped. I've looked at lots of lists of
>>> COS-whatever things, but it has not led me to "the answer."
>>>
>>> Thanks for any guidance you can provide.
>>>
>>> Dave Patterson
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Looking for a way to iterate over images in a PDF

Posted by David Patterson <pa...@gmail.com>.

Tilman,

The ExtractImages sample code is a 1.8 artifact (I believe). It has a lot
of errors when compiled with 2.0.5 libraries.

1) two imports are no longer in the 2.0.5 library
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;

2) missing methods or methods with different signatures:
PDDocument.loadNonSeq(                                            ** method
not define
PDDocument.load(                                                       **
load now requires a File, not a String
document.openProtection (
document.getDocumentCatalog().getAllPages()              ** getAllPages is
missing from the PDDocumentCatalog
resources.getXObjects()                                               **
where resources is a PDResources object
if (xobject instanceof PDXObjectImage)                         **
PDXObjectImage is not defined
else if (xobject instanceof PDXObjectForm)                   ** same with
PDXObjectForm

Maybe a new ExtractImages2 program needs to be developed for the PDFBox 2
era.

Dave Patterson




On Thu, Apr 6, 2017 at 5:02 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 06.04.2017 um 21:22 schrieb David Patterson:
>
>> I've got some PDF's to try to read. Many of them have images in them. I'd
>> like to be able to iterate over the images and determine their encoding
>> (png vs. jpeg vs. ?) and size.
>>
>> I've found a sample that lets me iterate over the PDXObject entities, but
>> I'm missing a key piece to determine the size and format of the objects.
>>
>> a) Is a PDXObject always an image, or could it be something else?
>>
>
> Yes it could be a form. That's why all examples (e.g. ExtractImages.java)
> always check the type, and the cast to the image xobject type. That one
> will give the size and the filters.
>
> Tilman
>
>
>> Here is the code I've got so far.
>>
>> for ( PDPage aPage : pdfDocument.getPages() ) {
>> PDResources pdResources = aPage.getResources();
>> for ( COSName cosObject : pdResources.getXObjectNames() ) {
>> PDXObject xObj = pdResources.getXObject( cosObject);
>> System.out.println( "got an image maybe" );
>>
>> This is where I've gotten stumped. I've looked at lots of lists of
>> COS-whatever things, but it has not led me to "the answer."
>>
>> Thanks for any guidance you can provide.
>>
>> Dave Patterson
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Looking for a way to iterate over images in a PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 06.04.2017 um 21:22 schrieb David Patterson:
> I've got some PDF's to try to read. Many of them have images in them. I'd
> like to be able to iterate over the images and determine their encoding
> (png vs. jpeg vs. ?) and size.
>
> I've found a sample that lets me iterate over the PDXObject entities, but
> I'm missing a key piece to determine the size and format of the objects.
>
> a) Is a PDXObject always an image, or could it be something else?

Yes it could be a form. That's why all examples (e.g. 
ExtractImages.java) always check the type, and the cast to the image 
xobject type. That one will give the size and the filters.

Tilman

>
> Here is the code I've got so far.
>
> for ( PDPage aPage : pdfDocument.getPages() ) {
> PDResources pdResources = aPage.getResources();
> for ( COSName cosObject : pdResources.getXObjectNames() ) {
> PDXObject xObj = pdResources.getXObject( cosObject);
> System.out.println( "got an image maybe" );
>
> This is where I've gotten stumped. I've looked at lots of lists of
> COS-whatever things, but it has not led me to "the answer."
>
> Thanks for any guidance you can provide.
>
> Dave Patterson
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org