You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "susheel (Created) (JIRA)" <ji...@apache.org> on 2011/11/14 11:12:51 UTC

[jira] [Created] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Images extracted from PDF are loosing color (are shown in blackcolor)
---------------------------------------------------------------------

                 Key: PDFBOX-1169
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 1.6.0
         Environment: Windows
            Reporter: susheel


Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
Checked extracting images, using other tools and images were extracted correctly.
Attached images extracted using PDFBox as well.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160791#comment-13160791 ] 

Andreas Lehmkühler commented on PDFBOX-1169:
--------------------------------------------

I found 3 different issues:

- the given pdf contains 2 images which are embedded in a XObjectForm which is embedded in another XObjectForm and can't be extracted using ExtractImages. I fixed that in revision 1209017
- PDJpeg.write2OutputStream assumed that every PDJpeg contains jpeg image data because of the used DCTFilter, but PDJpegs may also contain CMYK-encoded image data as in the given pdf. I fixed that in revision 1209015
- the colors of the image are wrong, but I don't know why. I'm still investigating
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "susheel (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149529#comment-13149529 ] 

susheel commented on PDFBOX-1169:
---------------------------------

Comment to extract the image:

private void processImages(PDResources resources, String destinationFolder) throws IOException {
		Map images = resources.getImages();

		if (images != null) {
			Iterator imageIter = images.keySet().iterator();
			while (imageIter.hasNext()) {
				String key = (String) imageIter.next();
				PDXObjectImage image = (PDXObjectImage) images.get(key);
				String name = null;
				name = destinationFolder + "image-" + imageCounter++ + "." + image.getSuffix();
						
				//image.write2file(name); - Tried image.write2file as well, but retrieved images were similar
				BufferedImage bufferedImage = image.getRGBImage();
				File outputfile = new File(name);
				ImageIO.write(bufferedImage,image.getSuffix(), outputfile);
				System.out.println("szaveri - using imageio to write files " + name + " suffix =" + image.getSuffix());
				
			}
		}
	}


Please note, out of 200 odd images in the PDF, only two got extracted correctly rest all are having images with black background. 

I am sure, I am missing out some configuration or someother parameter, but unable to find it out.

Just to update, have also added following JAI Jars in my project:
jai_codec
jai_core
mlibwrapper_jai
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "susheel (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

susheel updated PDFBOX-1169:
----------------------------

    Attachment: image-2.jpg
                image-1.jpg

Images which were extracted after reading the PDF using PDFBox.
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "susheel (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162645#comment-13162645 ] 

susheel commented on PDFBOX-1169:
---------------------------------

Dear Andreas

Wish that you crack the thrid issue quite quickly.

We have taken your two fixes and have ran the test on the PDF that we have. Image quality has improved considerably. I am sure, once we have the final issue fix from your end, we should be able to parse the PDF image quite easily.

If you need any data / inputs from our end, kindly let us know.

Thanks
Susheel Zaveri
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278703#comment-13278703 ] 

Andreas Lehmkühler commented on PDFBOX-1169:
--------------------------------------------

I guess the remaing issues is based on a missing feature called overprintcontrol which is part of the extended graphics state. PDFBOX-1223 describes a similar issue.
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>              Labels: overprintcontrol
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "susheel (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

susheel updated PDFBOX-1169:
----------------------------

    Attachment: eBook-Mini.pdf

eBook-Mini is the sample PDF that we have used for extracting image from the PDF.
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PDFBOX-1169) Images extracted from PDF are loosing color (are shown in blackcolor)

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-1169.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.8.0
         Assignee: Andreas Lehmkühler

My former guess was wrong. The JPEG uses a CMYK-colorspace but the image data are encoded using a YCCK colorspace.
I added a YCCK2RGB decoder in revision 1395294.

Thanks for the report!
                
> Images extracted from PDF are loosing color (are shown in blackcolor)
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-1169
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1169
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.6.0
>         Environment: Windows
>            Reporter: susheel
>            Assignee: Andreas Lehmkühler
>              Labels: overprintcontrol
>             Fix For: 1.8.0
>
>         Attachments: eBook-Mini.pdf, image-1.jpg, image-2.jpg
>
>
> Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
> When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
> Checked extracting images, using other tools and images were extracted correctly.
> Attached images extracted using PDFBox as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira