You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Pontus Hulin (Created) (JIRA)" <ji...@apache.org> on 2011/11/01 15:03:32 UTC

[jira] [Created] (PDFBOX-1154) pdfbox exports 1200+ images from a pdf instead of one

pdfbox exports 1200+ images from a pdf instead of one
-----------------------------------------------------

                 Key: PDFBOX-1154
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1154
             Project: PDFBox
          Issue Type: New Feature
    Affects Versions: 1.6.0
         Environment: Mac OS X 10.6
            Reporter: Pontus Hulin


I have a pdf that I export all images from. My problem is that I get 1290 images after export. If I export all images from the pdf in Acrobat Pro, I get only one. There must be some way that the pdf composes these images together, but I cant figure out how? 

The pdf is problbly made from an ad pdf placed in an indesign CS4 dokument and exported as a pdf by Indesign server 6.x

I dont need to compose all the images to one, I just want to filter out the ones that "belong" together. I will attach the pdf to this Issue.
Does anyone know how to do that?

Best regards
/ Pontus Hulin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1154) pdfbox exports 1200+ images from a pdf instead of one

Posted by "Pontus Hulin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141260#comment-13141260 ] 

Pontus Hulin commented on PDFBOX-1154:
--------------------------------------

Hello Andreas
I dont really need to compose them, but I need to detect them so that I can filter them out. Have not faound any way to that yet. Got any Idea?
I guess I could use the coordinates, but that would be quite expensive to calculate I think.
                
> pdfbox exports 1200+ images from a pdf instead of one
> -----------------------------------------------------
>
>                 Key: PDFBOX-1154
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1154
>             Project: PDFBox
>          Issue Type: New Feature
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.6
>            Reporter: Pontus Hulin
>              Labels: extract, images
>         Attachments: testfile.pdf
>
>
> I have a pdf that I export all images from. My problem is that I get 1290 images after export. If I export all images from the pdf in Acrobat Pro, I get only one. There must be some way that the pdf composes these images together, but I cant figure out how? 
> The pdf is problbly made from an ad pdf placed in an indesign CS4 dokument and exported as a pdf by Indesign server 6.x
> I dont need to compose all the images to one, I just want to filter out the ones that "belong" together. I will attach the pdf to this Issue.
> Does anyone know how to do that?
> Best regards
> / Pontus Hulin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1154) pdfbox exports 1200+ images from a pdf instead of one

Posted by "Pontus Hulin (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pontus Hulin updated PDFBOX-1154:
---------------------------------

    Attachment: testfile.pdf

When extracting images from this file I get 1290 images. How can I filrer out the ones I do not want?
                
> pdfbox exports 1200+ images from a pdf instead of one
> -----------------------------------------------------
>
>                 Key: PDFBOX-1154
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1154
>             Project: PDFBox
>          Issue Type: New Feature
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.6
>            Reporter: Pontus Hulin
>              Labels: extract, images
>         Attachments: testfile.pdf
>
>
> I have a pdf that I export all images from. My problem is that I get 1290 images after export. If I export all images from the pdf in Acrobat Pro, I get only one. There must be some way that the pdf composes these images together, but I cant figure out how? 
> The pdf is problbly made from an ad pdf placed in an indesign CS4 dokument and exported as a pdf by Indesign server 6.x
> I dont need to compose all the images to one, I just want to filter out the ones that "belong" together. I will attach the pdf to this Issue.
> Does anyone know how to do that?
> Best regards
> / Pontus Hulin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1154) pdfbox exports 1200+ images from a pdf instead of one

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141269#comment-13141269 ] 

Andreas Lehmkühler commented on PDFBOX-1154:
--------------------------------------------

May be the size of every single piece could be an indicator?
                
> pdfbox exports 1200+ images from a pdf instead of one
> -----------------------------------------------------
>
>                 Key: PDFBOX-1154
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1154
>             Project: PDFBox
>          Issue Type: New Feature
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.6
>            Reporter: Pontus Hulin
>              Labels: extract, images
>         Attachments: testfile.pdf
>
>
> I have a pdf that I export all images from. My problem is that I get 1290 images after export. If I export all images from the pdf in Acrobat Pro, I get only one. There must be some way that the pdf composes these images together, but I cant figure out how? 
> The pdf is problbly made from an ad pdf placed in an indesign CS4 dokument and exported as a pdf by Indesign server 6.x
> I dont need to compose all the images to one, I just want to filter out the ones that "belong" together. I will attach the pdf to this Issue.
> Does anyone know how to do that?
> Best regards
> / Pontus Hulin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1154) pdfbox exports 1200+ images from a pdf instead of one

Posted by "Pontus Hulin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152023#comment-13152023 ] 

Pontus Hulin commented on PDFBOX-1154:
--------------------------------------

I have taken a look at the pdf files that this dispaly this problem and this is what I have found: 
all pdf files seem to contain ImageI in the ProcSet in page Resources. 

The pdf also seems to contain an /Indexed object for each image that is a part of the large image. The /Indexed object look like its a Colorspace.

Example:
71 0 obj
[/Indexed/DeviceCMYK 33 145 0 R]

So, If an Image has a references to Colorspace, that is Indexed we should not bother with it, is my conclusion. 
Has anyone else got any idea if this is the case?

I will do some more testing and post the results here.

Best regards
/ Pontus
                
> pdfbox exports 1200+ images from a pdf instead of one
> -----------------------------------------------------
>
>                 Key: PDFBOX-1154
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1154
>             Project: PDFBox
>          Issue Type: New Feature
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.6
>            Reporter: Pontus Hulin
>              Labels: extract, images
>         Attachments: testfile.pdf
>
>
> I have a pdf that I export all images from. My problem is that I get 1290 images after export. If I export all images from the pdf in Acrobat Pro, I get only one. There must be some way that the pdf composes these images together, but I cant figure out how? 
> The pdf is problbly made from an ad pdf placed in an indesign CS4 dokument and exported as a pdf by Indesign server 6.x
> I dont need to compose all the images to one, I just want to filter out the ones that "belong" together. I will attach the pdf to this Issue.
> Does anyone know how to do that?
> Best regards
> / Pontus Hulin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1154) pdfbox exports 1200+ images from a pdf instead of one

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141237#comment-13141237 ] 

Andreas Lehmkühler commented on PDFBOX-1154:
--------------------------------------------

First of all I fixed a color issue with the attached file in revision 1196047. Many of those images are using a ICC-based colorspace with 4 color components (CMYK) which has to be converted to a RGB-based colorspace.

I don't know if there is any "glue" which composes all the images to one. You may want to convert the whole page to an image and crop it to the interesting part as workaround.

                
> pdfbox exports 1200+ images from a pdf instead of one
> -----------------------------------------------------
>
>                 Key: PDFBOX-1154
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1154
>             Project: PDFBox
>          Issue Type: New Feature
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.6
>            Reporter: Pontus Hulin
>              Labels: extract, images
>         Attachments: testfile.pdf
>
>
> I have a pdf that I export all images from. My problem is that I get 1290 images after export. If I export all images from the pdf in Acrobat Pro, I get only one. There must be some way that the pdf composes these images together, but I cant figure out how? 
> The pdf is problbly made from an ad pdf placed in an indesign CS4 dokument and exported as a pdf by Indesign server 6.x
> I dont need to compose all the images to one, I just want to filter out the ones that "belong" together. I will attach the pdf to this Issue.
> Does anyone know how to do that?
> Best regards
> / Pontus Hulin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira