You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Ian Kaplan (JIRA)" <ji...@apache.org> on 2009/12/03 00:27:20 UTC
[jira] Updated: (PDFBOX-574) PDFBox image extraction fails with an ArrayOutOfBoundsException in PDPixelMap.getRGBImage()

     [ https://issues.apache.org/jira/browse/PDFBOX-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ian Kaplan updated PDFBOX-574:
------------------------------

    Description: 

The project that I'm working on has been using PDFBox for both text extraction and image extraction from PDF documents.  We wrote a class, PDFImageStripper, which extends PDFStreamEngine:

{code}
public class PDFImageStripper extends PDFStreamEngine 
{code}

{code:Java}
	public List<ExtractedImage> getImages(PDDocument document, String documentFilename, File targetDirectory) throws IOException {
		resetEngine();
		
		this.document = document;
		this.documentFilename = documentFilename;
		this.targetDirectory = targetDirectory;
        
        currentImageNumber = 1;
        
        images.clear();
        writeImages();        
        return images;
    }
{code}

{code:Java}
	private void writeImages() throws IOException {
		List<PDPage> pages = (List<PDPage>) document.getDocumentCatalog().getAllPages();
		for (PDPage page : pages) {
		    if (page != null) {
		        processStream(page, page.findResources(), page.getContents().getStream());
		    }
	    }
	}
{code}

The call chain is shown below:

{noformat}
None.decode(byte[], byte[]) line: 57	
PDPixelMap.getRGBImage() line: 182	
PDPixelMap.write2OutputStream(OutputStream) line: 209	
PDPixelMap(PDXObjectImage).write2file(File) line: 142	
PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208	
PDFImageStripper.processOperator(PDFOperator, List) line: 155	
PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 229	
PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 188	
PDFImageStripper.writeImages() line: 113	
{noformat}

There is an ArrayOutOfBoundsException in the decode method.  The decode method is nothing more than a wrapper for a call to System.arraycopy():

{code:Java}
    public void decode(byte[] src, byte[] dest)
    {
        System.arraycopy(src,0,dest,0,src.length);
    }
{code}

The problem is, the source array is larger than the destination array.  This is show (from the Eclipse debugger) below:

{noformat}
src	byte[455112]  (id=171)	
dest	byte[435456]  (id=175)	
{noformat}

The code that seems to be causing the problem is shown below.  The branch that this bug shows up on is the LZW_DECODE branch.  Note that in the other code branch, the code makes sure that there is no size problem.

{code:Java}
            if( predictor < 10 ||
                filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
                         filters.contains( COSName.FLATE_DECODE.getName()) ) )
            {
                PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
                filter.setWidth(width);
                filter.setHeight(height);
                filter.setBpp((bpc * 3) / 8);
                filter.decode(array, bufferData);
            }
            else
            {
                System.arraycopy( array, 0,bufferData, 0, 
                        (array.length<bufferData.length?array.length: bufferData.length) );
            }
{code}

One fix may be to simply change the code as follows (again, recall that the "decode" method is nothing but a wrapper for System.arraycopy()):

{code:Java}
          if( predictor < 10 ||
                filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
                         filters.contains( COSName.FLATE_DECODE.getName()) ) )
            {
                PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
                filter.setWidth(width);
                filter.setHeight(height);
                filter.setBpp((bpc * 3) / 8);
            }
                System.arraycopy( array, 0,bufferData, 0, 
                        (array.length<bufferData.length?array.length: bufferData.length) );
{code}

If Jira allows me to attach a file that causes this problem I will do so.


  was:

The project that I'm working on has been using PDFBox for both text extraction and image extraction from PDF documents.  We wrote a class, PDFImageStripper, which extends PDFStreamEngine:

{code:Java}
public class PDFImageStripper extends PDFStreamEngine 
{code}

{code:Java}
	public List<ExtractedImage> getImages(PDDocument document, String documentFilename, File targetDirectory) throws IOException {
		resetEngine();
		
		this.document = document;
		this.documentFilename = documentFilename;
		this.targetDirectory = targetDirectory;
        
        currentImageNumber = 1;
        
        images.clear();
        writeImages();        
        return images;
    }
{code}

{code:Java}
	private void writeImages() throws IOException {
		List<PDPage> pages = (List<PDPage>) document.getDocumentCatalog().getAllPages();
		for (PDPage page : pages) {
		    if (page != null) {
		        processStream(page, page.findResources(), page.getContents().getStream());
		    }
	    }
	}
{code}

The call chain is shown below:

{noformat}
None.decode(byte[], byte[]) line: 57	
PDPixelMap.getRGBImage() line: 182	
PDPixelMap.write2OutputStream(OutputStream) line: 209	
PDPixelMap(PDXObjectImage).write2file(File) line: 142	
PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208	
PDFImageStripper.processOperator(PDFOperator, List) line: 155	
PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 229	
PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 188	
PDFImageStripper.writeImages() line: 113	
{noformat}

There is an ArrayOutOfBoundsException in the decode method.  The decode method is nothing more than a wrapper for a call to System.arraycopy():

{code:Java}
    public void decode(byte[] src, byte[] dest)
    {
        System.arraycopy(src,0,dest,0,src.length);
    }
{code}

The problem is, the source array is larger than the destination array.  This is show (from the Eclipse debugger) below:

{noformat}
src	byte[455112]  (id=171)	
dest	byte[435456]  (id=175)	
{noformat}

The code that seems to be causing the problem is shown below.  The branch that this bug shows up on is the LZW_DECODE branch.  Note that in the other code branch, the code makes sure that there is no size problem.

{code:Java}
            if( predictor < 10 ||
                filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
                         filters.contains( COSName.FLATE_DECODE.getName()) ) )
            {
                PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
                filter.setWidth(width);
                filter.setHeight(height);
                filter.setBpp((bpc * 3) / 8);
                filter.decode(array, bufferData);
            }
            else
            {
                System.arraycopy( array, 0,bufferData, 0, 
                        (array.length<bufferData.length?array.length: bufferData.length) );
            }
{code}

One fix may be to simply change the code as follows (again, recall that the "decode" method is nothing but a wrapper for System.arraycopy()):

{code:Java}
          if( predictor < 10 ||
                filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
                         filters.contains( COSName.FLATE_DECODE.getName()) ) )
            {
                PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
                filter.setWidth(width);
                filter.setHeight(height);
                filter.setBpp((bpc * 3) / 8);
            }
                System.arraycopy( array, 0,bufferData, 0, 
                        (array.length<bufferData.length?array.length: bufferData.length) );
{code}

If Jira allows me to attach a file that causes this problem I will do so.



> PDFBox image extraction fails with an ArrayOutOfBoundsException in PDPixelMap.getRGBImage()
> -------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-574
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-574
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 0.8.0-incubator
>         Environment: Java
>            Reporter: Ian Kaplan
>
> The project that I'm working on has been using PDFBox for both text extraction and image extraction from PDF documents.  We wrote a class, PDFImageStripper, which extends PDFStreamEngine:
> {code}
> public class PDFImageStripper extends PDFStreamEngine 
> {code}
> {code:Java}
> 	public List<ExtractedImage> getImages(PDDocument document, String documentFilename, File targetDirectory) throws IOException {
> 		resetEngine();
> 		
> 		this.document = document;
> 		this.documentFilename = documentFilename;
> 		this.targetDirectory = targetDirectory;
>         
>         currentImageNumber = 1;
>         
>         images.clear();
>         writeImages();        
>         return images;
>     }
> {code}
> {code:Java}
> 	private void writeImages() throws IOException {
> 		List<PDPage> pages = (List<PDPage>) document.getDocumentCatalog().getAllPages();
> 		for (PDPage page : pages) {
> 		    if (page != null) {
> 		        processStream(page, page.findResources(), page.getContents().getStream());
> 		    }
> 	    }
> 	}
> {code}
> The call chain is shown below:
> {noformat}
> None.decode(byte[], byte[]) line: 57	
> PDPixelMap.getRGBImage() line: 182	
> PDPixelMap.write2OutputStream(OutputStream) line: 209	
> PDPixelMap(PDXObjectImage).write2file(File) line: 142	
> PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208	
> PDFImageStripper.processOperator(PDFOperator, List) line: 155	
> PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 229	
> PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 188	
> PDFImageStripper.writeImages() line: 113	
> {noformat}
> There is an ArrayOutOfBoundsException in the decode method.  The decode method is nothing more than a wrapper for a call to System.arraycopy():
> {code:Java}
>     public void decode(byte[] src, byte[] dest)
>     {
>         System.arraycopy(src,0,dest,0,src.length);
>     }
> {code}
> The problem is, the source array is larger than the destination array.  This is show (from the Eclipse debugger) below:
> {noformat}
> src	byte[455112]  (id=171)	
> dest	byte[435456]  (id=175)	
> {noformat}
> The code that seems to be causing the problem is shown below.  The branch that this bug shows up on is the LZW_DECODE branch.  Note that in the other code branch, the code makes sure that there is no size problem.
> {code:Java}
>             if( predictor < 10 ||
>                 filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
>                          filters.contains( COSName.FLATE_DECODE.getName()) ) )
>             {
>                 PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
>                 filter.setWidth(width);
>                 filter.setHeight(height);
>                 filter.setBpp((bpc * 3) / 8);
>                 filter.decode(array, bufferData);
>             }
>             else
>             {
>                 System.arraycopy( array, 0,bufferData, 0, 
>                         (array.length<bufferData.length?array.length: bufferData.length) );
>             }
> {code}
> One fix may be to simply change the code as follows (again, recall that the "decode" method is nothing but a wrapper for System.arraycopy()):
> {code:Java}
>           if( predictor < 10 ||
>                 filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
>                          filters.contains( COSName.FLATE_DECODE.getName()) ) )
>             {
>                 PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
>                 filter.setWidth(width);
>                 filter.setHeight(height);
>                 filter.setBpp((bpc * 3) / 8);
>             }
>                 System.arraycopy( array, 0,bufferData, 0, 
>                         (array.length<bufferData.length?array.length: bufferData.length) );
> {code}
> If Jira allows me to attach a file that causes this problem I will do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.