You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Closed JIRA)" <ji...@apache.org> on 2012/02/05 19:49:53 UTC
[jira] [Closed] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-1202.
--------------------------------------

    Resolution: Not A Problem
      Assignee: Andreas Lehmkühler
    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira