You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Matt England (JIRA)" <ji...@apache.org> on 2011/03/15 20:04:29 UTC

[jira] Created: (PDFBOX-981) PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)

PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)
------------------------------------------------------------------------------------

                 Key: PDFBOX-981
                 URL: https://issues.apache.org/jira/browse/PDFBOX-981
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.5.0
            Reporter: Matt England


I was trying to use PDFTextStripper to extract text from a large corpus of PDF files. In some of them, the method:

org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace( COSBase colorSpace, Map colorSpaces )

fails to recognize the case when the colorSpace argument is of type COSArray and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding that case successfully parses the files that failed with the stock pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles the case where the colorspace name is DeviceGray. Incidentally, it occurs to me that another (possibly better) approach is to call through to createColorSpace(String) when no other case matches.

% diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
94a95,97
> else if ( type.getName().equals( PDDeviceGray.NAME) ) {
> retval = new PDDeviceGray();
> }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PDFBOX-981) PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)

Posted by "Matt England (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt England updated PDFBOX-981:
--------------------------------

    Attachment: PDColorSpaceFactory.java.diff

Patch for PDColorSpaceFactory

> PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-981
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-981
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.5.0
>            Reporter: Matt England
>              Labels: pdfbox
>         Attachments: PDColorSpaceFactory.java.diff, example.pdf
>
>
> I was trying to use PDFTextStripper to extract text from a large corpus of PDF files. In some of them, the method:
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace( COSBase colorSpace, Map colorSpaces )
> fails to recognize the case when the colorSpace argument is of type COSArray and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding that case successfully parses the files that failed with the stock pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles the case where the colorspace name is DeviceGray. Incidentally, it occurs to me that another (possibly better) approach is to call through to createColorSpace(String) when no other case matches.
> % diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
> 94a95,97
> > else if ( type.getName().equals( PDDeviceGray.NAME) ) {
> > retval = new PDDeviceGray();
> > }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PDFBOX-981) PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)

Posted by "Matt England (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt England updated PDFBOX-981:
--------------------------------

    Attachment: example.pdf

Example pdf file which fails with standard 1.5.0 but passes with included patch. Using PDFTextStripper like so:

(new PDFTextStripper()).getText(PDDocument.load(new FileInputStream("example.pdf")))

> PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-981
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-981
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.5.0
>            Reporter: Matt England
>              Labels: pdfbox
>         Attachments: PDColorSpaceFactory.java.diff, example.pdf
>
>
> I was trying to use PDFTextStripper to extract text from a large corpus of PDF files. In some of them, the method:
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace( COSBase colorSpace, Map colorSpaces )
> fails to recognize the case when the colorSpace argument is of type COSArray and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding that case successfully parses the files that failed with the stock pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles the case where the colorspace name is DeviceGray. Incidentally, it occurs to me that another (possibly better) approach is to call through to createColorSpace(String) when no other case matches.
> % diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
> 94a95,97
> > else if ( type.getName().equals( PDDeviceGray.NAME) ) {
> > retval = new PDDeviceGray();
> > }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (PDFBOX-981) PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-981.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6.0
         Assignee: Andreas Lehmkühler

I added the proposed patch in revision 1083488. 

Thanks for the contribution!

> PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-981
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-981
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.5.0
>            Reporter: Matt England
>            Assignee: Andreas Lehmkühler
>              Labels: pdfbox
>             Fix For: 1.6.0
>
>         Attachments: PDColorSpaceFactory.java.diff, example.pdf
>
>
> I was trying to use PDFTextStripper to extract text from a large corpus of PDF files. In some of them, the method:
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace( COSBase colorSpace, Map colorSpaces )
> fails to recognize the case when the colorSpace argument is of type COSArray and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding that case successfully parses the files that failed with the stock pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles the case where the colorspace name is DeviceGray. Incidentally, it occurs to me that another (possibly better) approach is to call through to createColorSpace(String) when no other case matches.
> % diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
> 94a95,97
> > else if ( type.getName().equals( PDDeviceGray.NAME) ) {
> > retval = new PDDeviceGray();
> > }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira