You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Ilija Pavlic (Created) (JIRA)" <ji...@apache.org> on 2012/01/05 02:12:40 UTC

[jira] [Created] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
-------------------------------------------------------------------------------

                 Key: PDFBOX-1202
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0
            Reporter: Ilija Pavlic
            Priority: Critical


Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Timo Boehme (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182431#comment-13182431 ] 

Timo Boehme commented on PDFBOX-1202:
-------------------------------------

I could run your program with the provided file without a problem (all 851 pages were processed). I also think that you are hit by an OutOfMemoryException. Can you compile you own PDFBox from source and modify org.apache.pdfbox.filter.FlateFilter to print a stack trace of the exception?
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180286#comment-13180286 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

I have edited the issue to include information about the encryption. I have used the inserted snippet to decrypt the document. Either way, I was able to extract text from pages 1-396 without problems. Therefore I do not think the issue should be closed.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Description: 
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The error is thrown at:
- page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
- page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
- page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)

The error is not thrown if:
- the loop starts at page 452 or later
- the loop starts at 0 and ends before 396
- the loop starts at 200 and ends before 595

Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?

Here is the full code:
import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;

public class Main {

	public static void main(String[] args) throws IOException,
			COSVisitorException, CryptographyException {
		
		PDDocument document = null;
		try {
			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
			if (document.isEncrypted()) {
				try {
					document.decrypt("");
				} catch (InvalidPasswordException e) {
					System.err.println("Error: Document is encrypted with a password.");
					System.exit(1);
				}
			}

			float x = 55f;
			float y = 40f;
			float width = 168.5f;
			float height = 689f;
			float evenOffset = -10f;

			List allPages = document.getDocumentCatalog().getAllPages();

			for (int i = 0; i < allPages.size(); i++) {
				System.out.println("Page " + i);

				PDPage page = (PDPage) allPages.get(i);
				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
				stripper.setSortByPosition(true);

				for (int j = 0; j < 3; j++)
				{
					if (i % 2 == 0) {
						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
						stripper.addRegion("region", region);

					}
					else {
						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
						stripper.addRegion("region", region);
					}
				}

				stripper.extractRegions(page);

				for (String regionName : stripper.getRegions())
				{
					stripper.getTextForRegion(regionName);
				}
			}
		}
		
		catch(Exception e) {
			e.printStackTrace();
		}

		finally {
			if (document != null) {
				document.close();
			}
		}
	}
}


  was:
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The error is thrown at:
- page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
- page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
- page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)

The error is not thrown if:
- the loop starts at page 452 or later
- the loop starts at 0 and ends before 396
- the loop starts at 200 and ends before 595

Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?

Here is the full code:
package transhotel.pdf.iata;

import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;

public class Main {

	public static void main(String[] args) throws IOException,
			COSVisitorException, CryptographyException {
		
		PDDocument document = null;
		try {
			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
			if (document.isEncrypted()) {
				try {
					document.decrypt("");
				} catch (InvalidPasswordException e) {
					System.err.println("Error: Document is encrypted with a password.");
					System.exit(1);
				}
			}

			float x = 55f;
			float y = 40f;
			float width = 168.5f;
			float height = 689f;
			float evenOffset = -10f;

			List allPages = document.getDocumentCatalog().getAllPages();

			for (int i = 0; i < allPages.size(); i++) {
				System.out.println("Page " + i);

				PDPage page = (PDPage) allPages.get(i);
				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
				stripper.setSortByPosition(true);

				for (int j = 0; j < 3; j++)
				{
					if (i % 2 == 0) {
						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
						stripper.addRegion("region", region);

					}
					else {
						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
						stripper.addRegion("region", region);
					}
				}

				stripper.extractRegions(page);

				for (String regionName : stripper.getRegions())
				{
					stripper.getTextForRegion(regionName);
				}
			}
		}
		
		catch(Exception e) {
			e.printStackTrace();
		}

		finally {
			if (document != null) {
				document.close();
			}
		}
	}
}


    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183733#comment-13183733 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

Here's the stack trace from the latest pdfbox built from svn.

11.01.2012. 01:10:42 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to an OutOfMemoryError
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
	at org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
	at org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:117)
	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
	at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
	at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:262)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
	at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)
	at pdf.test.Main.main(Main.java:61)

You were right about an java.lang.OutOfMemoryError error. What does that mean? Somewhat amusing is that a larger document of a similar type (947 pages long) can be read without the exception thrown. 
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181683#comment-13181683 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

Tried it now, the error gets thrown in the same way when calling resetEngine() in the loop for the PDFTextStripperByArea instantiated outside the loop.

My original expectation was that the class would be garbage collected, and that there would be no harm in multiple instatiations for each page, as the PDFTextStripperByArea's extractRegions is called on a single page, as in "stripper.extractRegions(page)". In either case, the error is not prevented by using the a single instance of PDFTextStripperByArea and resetting the stripper by invoking resetEngine() inside the loop.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Andreas Lehmkühler (Closed JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-1202.
--------------------------------------

    Resolution: Not A Problem
      Assignee: Andreas Lehmkühler
    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

       Priority: Critical  (was: Minor)
    Description: 
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The error is thrown at:
- page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
- page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
- page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)

The error is not thrown if:
- the loop starts at page 452 or later
- the loop starts at 0 and ends before 396
- the loop starts at 200 and ends before 595

Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?

Here is the full code:
package transhotel.pdf.iata;

import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;

public class Main {

	public static void main(String[] args) throws IOException,
			COSVisitorException, CryptographyException {
		
		PDDocument document = null;
		try {
			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
			if (document.isEncrypted()) {
				try {
					document.decrypt("");
				} catch (InvalidPasswordException e) {
					System.err.println("Error: Document is encrypted with a password.");
					System.exit(1);
				}
			}

			float x = 55f;
			float y = 40f;
			float width = 168.5f;
			float height = 689f;
			float evenOffset = -10f;

			List allPages = document.getDocumentCatalog().getAllPages();

			for (int i = 0; i < allPages.size(); i++) {
				System.out.println("Page " + i);

				PDPage page = (PDPage) allPages.get(i);
				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
				stripper.setSortByPosition(true);

				for (int j = 0; j < 3; j++)
				{
					if (i % 2 == 0) {
						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
						stripper.addRegion("region", region);

					}
					else {
						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
						stripper.addRegion("region", region);
					}
				}

				stripper.extractRegions(page);

				for (String regionName : stripper.getRegions())
				{
					stripper.getTextForRegion(regionName);
				}
			}
		}
		
		catch(Exception e) {
			e.printStackTrace();
		}

		finally {
			if (document != null) {
				document.close();
			}
		}
	}
}


  was:
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The document was loaded with the following snippet:

document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
    if (document.isEncrypted()) {
        try {
	    document.decrypt("");
	} catch (InvalidPasswordException e) {
	    System.err.println("Error: Document is encrypted with a password.");
	    System.exit(1);
	}
}

    Environment: Mac OS X 10.7.2

Updated to include new obtained information. Full code included.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> package transhotel.pdf.iata;
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Description: 
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The document was loaded with the following snippet:

document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
    if (document.isEncrypted()) {
        try {
	    document.decrypt("");
	} catch (InvalidPasswordException e) {
	    System.err.println("Error: Document is encrypted with a password.");
	    System.exit(1);
	}
}

  was:
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The document was loaded with the following snipper:

document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
    if (document.isEncrypted()) {
        try {
	    document.decrypt("");
	} catch (InvalidPasswordException e) {
	    System.err.println("Error: Document is encrypted with a password.");
	    System.exit(1);
	}
}

    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181494#comment-13181494 ] 

Andreas Lehmkühler commented on PDFBOX-1202:
--------------------------------------------

Sounds like an OutOfMemoryException. Try to reuse the PDFTextStripperByArea instance. Instead of creating a new one for every page you should call resetEngine().
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> package transhotel.pdf.iata;
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Comment: was deleted

(was: I am also able to extract the text from the entire document without removing security, by just invoking decrypt in a minimal working example.

The difference between the MWE and the buggy code is that I have been appending to the PDPageContentStream, so perhaps that was the reason for the error. I don't have the code by hand now, I'll look into it later in the day.)
    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> package transhotel.pdf.iata;
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Timo Boehme (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180310#comment-13180310 ] 

Timo Boehme commented on PDFBOX-1202:
-------------------------------------

reading (displaying) vs. text extraction:
For an encrypted document one can define certain flags defining what is allowed to do with this document beside presenting document to the user; one is "Copying or otherwise extracting text and graphics from the document..." (PDF spec. 1.7, p.121). There is no mechanism for holding off a program to ignore these settings ones it has decrypted the document but the spec. states: "It is up to the implementors of PDF consumer applications to respect the intent of the document creator by restricting user access to an encrypted PDF file according to the permissions contained in the file."
Within PDFBox you can find this test e.g. in o.a.p.ExtractText. If the test is removed the text content will be extracted without errors.

In your case it seems that not all objects got decrypted and therefore the stream parsing failed. Since ExtractText works you might use another/own extraction method which accesses other objects. So it might be an error of PDFBox not decrypting or not using a decrypted specific stream or it depends on the kind of access in your routine.

For further investigation it would be necessary to know the object number of the problematic stream or to have the complete routine you use for reading the document.

                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Priority: Minor  (was: Critical)
    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Timo Boehme (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180272#comment-13180272 ] 

Timo Boehme commented on PDFBOX-1202:
-------------------------------------

Since the document is encrypted it must be decrypted first. Otherwise streams cannot be processed. Normally decryption is done automatically using PDDocument.load methods. While I could verify that reading (and displaying) text from decrypted document does not produce mentioned error please mind that security settings of this document do not permit extracting text.

I think that this report should be closed as 'not a bug'.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Comment: was deleted

(was: I am able to extract text from the entire document by calling `decrypt`, setting `document.setAllSecurityToBeRemoved(true);`, saving the document and reopening the saved document.)
    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> package transhotel.pdf.iata;
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Description: 
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The error is thrown at:
- page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
- page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
- page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)

The error is not thrown if:
- the loop starts at page 452 or later
- the loop starts at 0 and ends before 396
- the loop starts at 200 and ends before 595

Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?

Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.

import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;

public class Main {

	public static void main(String[] args) throws IOException,
			COSVisitorException, CryptographyException {
		
		PDDocument document = null;
		try {
			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
			if (document.isEncrypted()) {
				try {
					document.decrypt("");
				} catch (InvalidPasswordException e) {
					System.err.println("Error: Document is encrypted with a password.");
					System.exit(1);
				}
			}

			float x = 55f;
			float y = 40f;
			float width = 168.5f;
			float height = 689f;
			float evenOffset = -10f;

			List allPages = document.getDocumentCatalog().getAllPages();

			for (int i = 0; i < allPages.size(); i++) {
				System.out.println("Page " + i);

				PDPage page = (PDPage) allPages.get(i);
				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
				stripper.setSortByPosition(true);

				for (int j = 0; j < 3; j++)
				{
					if (i % 2 == 0) {
						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
						stripper.addRegion("region", region);

					}
					else {
						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
						stripper.addRegion("region", region);
					}
				}

				stripper.extractRegions(page);

				for (String regionName : stripper.getRegions())
				{
					stripper.getTextForRegion(regionName);
				}
			}
		}
		
		catch(Exception e) {
			e.printStackTrace();
		}

		finally {
			if (document != null) {
				document.close();
			}
		}
	}
}


  was:
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The error is thrown at:
- page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
- page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
- page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)

The error is not thrown if:
- the loop starts at page 452 or later
- the loop starts at 0 and ends before 396
- the loop starts at 200 and ends before 595

Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?

Here is the full code:
import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;

public class Main {

	public static void main(String[] args) throws IOException,
			COSVisitorException, CryptographyException {
		
		PDDocument document = null;
		try {
			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
			if (document.isEncrypted()) {
				try {
					document.decrypt("");
				} catch (InvalidPasswordException e) {
					System.err.println("Error: Document is encrypted with a password.");
					System.exit(1);
				}
			}

			float x = 55f;
			float y = 40f;
			float width = 168.5f;
			float height = 689f;
			float evenOffset = -10f;

			List allPages = document.getDocumentCatalog().getAllPages();

			for (int i = 0; i < allPages.size(); i++) {
				System.out.println("Page " + i);

				PDPage page = (PDPage) allPages.get(i);
				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
				stripper.setSortByPosition(true);

				for (int j = 0; j < 3; j++)
				{
					if (i % 2 == 0) {
						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
						stripper.addRegion("region", region);

					}
					else {
						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
						stripper.addRegion("region", region);
					}
				}

				stripper.extractRegions(page);

				for (String regionName : stripper.getRegions())
				{
					stripper.getTextForRegion(regionName);
				}
			}
		}
		
		catch(Exception e) {
			e.printStackTrace();
		}

		finally {
			if (document != null) {
				document.close();
			}
		}
	}
}


    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182437#comment-13182437 ] 

Andreas Lehmkühler commented on PDFBOX-1202:
--------------------------------------------

The recent version of org.apache.pdfbox.filter.FlateFilter (rev. 1227993) already provides a more specific error message, see PDFBOX-847 for further details.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Issue Comment Edited] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180286#comment-13180286 ] 

Ilija Pavlic edited comment on PDFBOX-1202 at 1/5/12 11:24 AM:
---------------------------------------------------------------

I have edited the issue to include information about the encryption. I have used the inserted snippet to decrypt the document. Either way, I was able to extract text from pages 1-396 without problems. Therefore I do not think the issue should be closed. Can you please clarify the difference between reading and extracting? Thank you!
                
      was (Author: ipavlic):
    I have edited the issue to include information about the encryption. I have used the inserted snippet to decrypt the document. Either way, I was able to extract text from pages 1-396 without problems. Therefore I do not think the issue should be closed.
                  
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180300#comment-13180300 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

I am able to extract text from the entire document by calling `decrypt`, setting `document.setAllSecurityToBeRemoved(true);`, saving the document and reopening the saved document.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Attachment: IATAUnitedStates.pdf

The errors are thrown starting from page 397 in the attached document.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180344#comment-13180344 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

I am also able to extract the text from the entire document without removing security, by just invoking decrypt in a minimal working example.

The difference between the MWE and the buggy code is that I have been appending to the PDPageContentStream, so perhaps that was the reason for the error. I don't have the code by hand now, I'll look into it later in the day.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snippet:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183740#comment-13183740 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

Java SE 6 on Eclipse for Mac only has 128 MB available by default. Adding -Xms256M -Xmx512M for default VM arguments prevents the exception from occurring.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
> 	public static void main(String[] args) throws IOException,
> 			COSVisitorException, CryptographyException {
> 		
> 		PDDocument document = null;
> 		try {
> 			document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
> 			if (document.isEncrypted()) {
> 				try {
> 					document.decrypt("");
> 				} catch (InvalidPasswordException e) {
> 					System.err.println("Error: Document is encrypted with a password.");
> 					System.exit(1);
> 				}
> 			}
> 			float x = 55f;
> 			float y = 40f;
> 			float width = 168.5f;
> 			float height = 689f;
> 			float evenOffset = -10f;
> 			List allPages = document.getDocumentCatalog().getAllPages();
> 			for (int i = 0; i < allPages.size(); i++) {
> 				System.out.println("Page " + i);
> 				PDPage page = (PDPage) allPages.get(i);
> 				PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> 				stripper.setSortByPosition(true);
> 				for (int j = 0; j < 3; j++)
> 				{
> 					if (i % 2 == 0) {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 					else {
> 						Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height);
> 						stripper.addRegion("region", region);
> 					}
> 				}
> 				stripper.extractRegions(page);
> 				for (String regionName : stripper.getRegions())
> 				{
> 					stripper.getTextForRegion(regionName);
> 				}
> 			}
> 		}
> 		
> 		catch(Exception e) {
> 			e.printStackTrace();
> 		}
> 		finally {
> 			if (document != null) {
> 				document.close();
> 			}
> 		}
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1202:
---------------------------------

    Description: 
Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

The document was loaded with the following snipper:

document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
    if (document.isEncrypted()) {
        try {
	    document.decrypt("");
	} catch (InvalidPasswordException e) {
	    System.err.println("Error: Document is encrypted with a password.");
	    System.exit(1);
	}
}

  was:Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

    
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.
> The document was loaded with the following snipper:
> document = PDDocument.load("C:/Users/ilija.pavlic/Downloads/TestInput.pdf");
>     if (document.isEncrypted()) {
>         try {
> 	    document.decrypt("");
> 	} catch (InvalidPasswordException e) {
> 	    System.err.println("Error: Document is encrypted with a password.");
> 	    System.exit(1);
> 	}
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira