You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2013/08/09 19:54:50 UTC
[jira] [Comment Edited] (PDFBOX-955) Can't extract b/w images from PDF

    [ https://issues.apache.org/jira/browse/PDFBOX-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13735061#comment-13735061 ] 

Tilman Hausherr edited comment on PDFBOX-955 at 8/9/13 5:53 PM:
----------------------------------------------------------------

PDF Files with G4 images are blank again. It can be reproduced with the file d0000040.pdf. The reason seems to be that the pixels of the embedded TIF files are reversed, and then drawn on a white image. So we get white on white, i.e. nothing. I "prove" my point with this change in pdfbox\pdfviewer\PageDrawer.java (this is not a fix, but it will hopefully give a hint):

    public void drawImage(Image awtImage, AffineTransform at)
    {
        graphics.setComposite(getGraphicsState().getStrokeJavaComposite());
        graphics.setClip(getGraphicsState().getCurrentClippingPath());
        
        //these two lines from me
        graphics.setColor(Color.BLACK);
        graphics.fillRect(0, 0, 5000, 5000);
        
        graphics.drawImage(awtImage, at, null);
    }

Now the rendered file is no longer white only, it is white on black. I suspect that the problem is somehow related to transparant backgrounds / pixels.
                
      was (Author: tilman):
    PDF Files with G4 images are blank again. It can be reproduced with the file d0000040.pdf of PDFBOX-955. The reason seems to be that the pixels of the embedded TIF files are reversed, and then drawn on a white image. So we get white on white, i.e. nothing. I "prove" my point with this change in pdfbox\pdfviewer\PageDrawer.java (this is not a fix, but it will hopefully give a hint):

    public void drawImage(Image awtImage, AffineTransform at)
    {
        graphics.setComposite(getGraphicsState().getStrokeJavaComposite());
        graphics.setClip(getGraphicsState().getCurrentClippingPath());
        
        //these two lines from me
        graphics.setColor(Color.BLACK);
        graphics.fillRect(0, 0, 5000, 5000);
        
        graphics.drawImage(awtImage, at, null);
    }

Now the rendered file is no longer white only, it is white on black. I suspect that the problem is somehow related to transparant backgrounds / pixels.
                  
> Can't extract b/w images from PDF
> ---------------------------------
>
>                 Key: PDFBOX-955
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-955
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.4.0
>         Environment: Windows XP prof, Java 1.6.0_22, Netbeans 6.9.1
>            Reporter: Tilman Hausherr
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>              Labels: extract
>             Fix For: 1.6.0
>
>         Attachments: d0000040-01.png, d0000040.pdf, ExtractImages.java, PDFBOX955-d00000401.png, PDFBOX955-photo1.png, photo.jpg, photo.pdf
>
>
> I wrote a test application using org.apache.pdfbox.ExtractImages to... extract images as PNG. (This is the start of something bigger, which involves making a statistic about the content of over a million pages within PDF files) However all images I get are all black or all white when I test on our own PDF files. I did get correct images from a file that had color images. To extract, I tried page.convertToImage() and then writing with ImageIO.write(), but I also tried using PDFImageWriter, neither had success for b/w images.
> The sample PDF is not confidential; it does give a warning "getRGBImage returned NULL" but other PDFs that don't give the warning (but are confidential) also fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira