You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Ian Kaplan (JIRA)" <ji...@apache.org> on 2009/12/03 00:27:20 UTC
[jira] Updated: (PDFBOX-574) PDFBox image extraction fails with an
ArrayOutOfBoundsException in PDPixelMap.getRGBImage()
[ https://issues.apache.org/jira/browse/PDFBOX-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ian Kaplan updated PDFBOX-574:
------------------------------
Description:
The project that I'm working on has been using PDFBox for both text extraction and image extraction from PDF documents. We wrote a class, PDFImageStripper, which extends PDFStreamEngine:
{code}
public class PDFImageStripper extends PDFStreamEngine
{code}
{code:Java}
public List<ExtractedImage> getImages(PDDocument document, String documentFilename, File targetDirectory) throws IOException {
resetEngine();
this.document = document;
this.documentFilename = documentFilename;
this.targetDirectory = targetDirectory;
currentImageNumber = 1;
images.clear();
writeImages();
return images;
}
{code}
{code:Java}
private void writeImages() throws IOException {
List<PDPage> pages = (List<PDPage>) document.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
if (page != null) {
processStream(page, page.findResources(), page.getContents().getStream());
}
}
}
{code}
The call chain is shown below:
{noformat}
None.decode(byte[], byte[]) line: 57
PDPixelMap.getRGBImage() line: 182
PDPixelMap.write2OutputStream(OutputStream) line: 209
PDPixelMap(PDXObjectImage).write2file(File) line: 142
PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208
PDFImageStripper.processOperator(PDFOperator, List) line: 155
PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 229
PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 188
PDFImageStripper.writeImages() line: 113
{noformat}
There is an ArrayOutOfBoundsException in the decode method. The decode method is nothing more than a wrapper for a call to System.arraycopy():
{code:Java}
public void decode(byte[] src, byte[] dest)
{
System.arraycopy(src,0,dest,0,src.length);
}
{code}
The problem is, the source array is larger than the destination array. This is show (from the Eclipse debugger) below:
{noformat}
src byte[455112] (id=171)
dest byte[435456] (id=175)
{noformat}
The code that seems to be causing the problem is shown below. The branch that this bug shows up on is the LZW_DECODE branch. Note that in the other code branch, the code makes sure that there is no size problem.
{code:Java}
if( predictor < 10 ||
filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
filters.contains( COSName.FLATE_DECODE.getName()) ) )
{
PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
filter.setWidth(width);
filter.setHeight(height);
filter.setBpp((bpc * 3) / 8);
filter.decode(array, bufferData);
}
else
{
System.arraycopy( array, 0,bufferData, 0,
(array.length<bufferData.length?array.length: bufferData.length) );
}
{code}
One fix may be to simply change the code as follows (again, recall that the "decode" method is nothing but a wrapper for System.arraycopy()):
{code:Java}
if( predictor < 10 ||
filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
filters.contains( COSName.FLATE_DECODE.getName()) ) )
{
PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
filter.setWidth(width);
filter.setHeight(height);
filter.setBpp((bpc * 3) / 8);
}
System.arraycopy( array, 0,bufferData, 0,
(array.length<bufferData.length?array.length: bufferData.length) );
{code}
If Jira allows me to attach a file that causes this problem I will do so.
was:
The project that I'm working on has been using PDFBox for both text extraction and image extraction from PDF documents. We wrote a class, PDFImageStripper, which extends PDFStreamEngine:
{code:Java}
public class PDFImageStripper extends PDFStreamEngine
{code}
{code:Java}
public List<ExtractedImage> getImages(PDDocument document, String documentFilename, File targetDirectory) throws IOException {
resetEngine();
this.document = document;
this.documentFilename = documentFilename;
this.targetDirectory = targetDirectory;
currentImageNumber = 1;
images.clear();
writeImages();
return images;
}
{code}
{code:Java}
private void writeImages() throws IOException {
List<PDPage> pages = (List<PDPage>) document.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
if (page != null) {
processStream(page, page.findResources(), page.getContents().getStream());
}
}
}
{code}
The call chain is shown below:
{noformat}
None.decode(byte[], byte[]) line: 57
PDPixelMap.getRGBImage() line: 182
PDPixelMap.write2OutputStream(OutputStream) line: 209
PDPixelMap(PDXObjectImage).write2file(File) line: 142
PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208
PDFImageStripper.processOperator(PDFOperator, List) line: 155
PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 229
PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 188
PDFImageStripper.writeImages() line: 113
{noformat}
There is an ArrayOutOfBoundsException in the decode method. The decode method is nothing more than a wrapper for a call to System.arraycopy():
{code:Java}
public void decode(byte[] src, byte[] dest)
{
System.arraycopy(src,0,dest,0,src.length);
}
{code}
The problem is, the source array is larger than the destination array. This is show (from the Eclipse debugger) below:
{noformat}
src byte[455112] (id=171)
dest byte[435456] (id=175)
{noformat}
The code that seems to be causing the problem is shown below. The branch that this bug shows up on is the LZW_DECODE branch. Note that in the other code branch, the code makes sure that there is no size problem.
{code:Java}
if( predictor < 10 ||
filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
filters.contains( COSName.FLATE_DECODE.getName()) ) )
{
PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
filter.setWidth(width);
filter.setHeight(height);
filter.setBpp((bpc * 3) / 8);
filter.decode(array, bufferData);
}
else
{
System.arraycopy( array, 0,bufferData, 0,
(array.length<bufferData.length?array.length: bufferData.length) );
}
{code}
One fix may be to simply change the code as follows (again, recall that the "decode" method is nothing but a wrapper for System.arraycopy()):
{code:Java}
if( predictor < 10 ||
filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
filters.contains( COSName.FLATE_DECODE.getName()) ) )
{
PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
filter.setWidth(width);
filter.setHeight(height);
filter.setBpp((bpc * 3) / 8);
}
System.arraycopy( array, 0,bufferData, 0,
(array.length<bufferData.length?array.length: bufferData.length) );
{code}
If Jira allows me to attach a file that causes this problem I will do so.
> PDFBox image extraction fails with an ArrayOutOfBoundsException in PDPixelMap.getRGBImage()
> -------------------------------------------------------------------------------------------
>
> Key: PDFBOX-574
> URL: https://issues.apache.org/jira/browse/PDFBOX-574
> Project: PDFBox
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 0.8.0-incubator
> Environment: Java
> Reporter: Ian Kaplan
>
> The project that I'm working on has been using PDFBox for both text extraction and image extraction from PDF documents. We wrote a class, PDFImageStripper, which extends PDFStreamEngine:
> {code}
> public class PDFImageStripper extends PDFStreamEngine
> {code}
> {code:Java}
> public List<ExtractedImage> getImages(PDDocument document, String documentFilename, File targetDirectory) throws IOException {
> resetEngine();
>
> this.document = document;
> this.documentFilename = documentFilename;
> this.targetDirectory = targetDirectory;
>
> currentImageNumber = 1;
>
> images.clear();
> writeImages();
> return images;
> }
> {code}
> {code:Java}
> private void writeImages() throws IOException {
> List<PDPage> pages = (List<PDPage>) document.getDocumentCatalog().getAllPages();
> for (PDPage page : pages) {
> if (page != null) {
> processStream(page, page.findResources(), page.getContents().getStream());
> }
> }
> }
> {code}
> The call chain is shown below:
> {noformat}
> None.decode(byte[], byte[]) line: 57
> PDPixelMap.getRGBImage() line: 182
> PDPixelMap.write2OutputStream(OutputStream) line: 209
> PDPixelMap(PDXObjectImage).write2file(File) line: 142
> PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208
> PDFImageStripper.processOperator(PDFOperator, List) line: 155
> PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 229
> PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 188
> PDFImageStripper.writeImages() line: 113
> {noformat}
> There is an ArrayOutOfBoundsException in the decode method. The decode method is nothing more than a wrapper for a call to System.arraycopy():
> {code:Java}
> public void decode(byte[] src, byte[] dest)
> {
> System.arraycopy(src,0,dest,0,src.length);
> }
> {code}
> The problem is, the source array is larger than the destination array. This is show (from the Eclipse debugger) below:
> {noformat}
> src byte[455112] (id=171)
> dest byte[435456] (id=175)
> {noformat}
> The code that seems to be causing the problem is shown below. The branch that this bug shows up on is the LZW_DECODE branch. Note that in the other code branch, the code makes sure that there is no size problem.
> {code:Java}
> if( predictor < 10 ||
> filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
> filters.contains( COSName.FLATE_DECODE.getName()) ) )
> {
> PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
> filter.setWidth(width);
> filter.setHeight(height);
> filter.setBpp((bpc * 3) / 8);
> filter.decode(array, bufferData);
> }
> else
> {
> System.arraycopy( array, 0,bufferData, 0,
> (array.length<bufferData.length?array.length: bufferData.length) );
> }
> {code}
> One fix may be to simply change the code as follows (again, recall that the "decode" method is nothing but a wrapper for System.arraycopy()):
> {code:Java}
> if( predictor < 10 ||
> filters == null || !(filters.contains( COSName.LZW_DECODE.getName()) ||
> filters.contains( COSName.FLATE_DECODE.getName()) ) )
> {
> PredictorAlgorithm filter = PredictorAlgorithm.getFilter(predictor);
> filter.setWidth(width);
> filter.setHeight(height);
> filter.setBpp((bpc * 3) / 8);
> }
> System.arraycopy( array, 0,bufferData, 0,
> (array.length<bufferData.length?array.length: bufferData.length) );
> {code}
> If Jira allows me to attach a file that causes this problem I will do so.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.