You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/10/26 19:52:59 UTC

[jira] Updated: (PDFBOX-548) IOException in extracting text using PDFBox

     [ https://issues.apache.org/jira/browse/PDFBOX-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-548:
--------------------------------------

    Fix Version/s:     (was: 0.7.0)

> IOException in extracting text using PDFBox
> -------------------------------------------
>
>                 Key: PDFBOX-548
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-548
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing IOException in extracting text using PDFBox. PDF file I am trying to read is NOT password protected.
> Program throws IOException at the line 
> pdfText = stripper.getText(pdDoc); 
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> 	{
> 		boolean status = false;
> 		String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
> 		COSDocument cosDoc = null;
> 		PDDocument pdDoc = null;
> 		try
> 		{
> 			cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));   		 
> 			// skipping the PDF document, if it is encrypted
> 			if (cosDoc.isEncrypted()) {
> 				logger.warn("Can not decrypt PDF document w/o password, skipping:"+	pdfEntry.getName());
> 				return pdfText;
> 			}
> 			// extract PDF document's textual content
> 			  pdDoc = new PDDocument(cosDoc);
> 			  PDFTextStripper stripper = new PDFTextStripper();
> 			  pdfText = stripper.getText(pdDoc); // THIS LINE THROWS IOException
> 		}
> 		catch (IOException e) {
> 		  pdfText = null;
> 		  logger.error("IOException in parsing PDF document: " + e);
> 		}
> 		finally{
> 			closeCOSDocument(cosDoc);
> 			closePDDocument(pdDoc);
> 		}
>                return pdfText;
> 	}
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.