You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jignesh Sh (JIRA)" <ji...@apache.org> on 2009/11/16 13:07:39 UTC
[jira] Commented: (PDFBOX-547) problem in extracting text using PDFBox

    [ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778306#action_12778306 ] 

Jignesh Sh commented on PDFBOX-547:
-----------------------------------

This issue is closed after I use the following 2 latest PDF box jar files
pdfbox-0.8.0-incubating.jar
fontbox-0.8.0-incubating.jar

Thanks,
Jignesh

> problem in extracting text using PDFBox
> ---------------------------------------
>
>                 Key: PDFBOX-547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> 	{
> 		boolean status = false;
> 		String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
> 		COSDocument cosDoc = null;
> 		PDDocument pdDoc = null;
> 		try
> 		{
> 			cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream into memory
> 		 
> 			// skipping the PDF document, if it is encrypted
> 			if (cosDoc.isEncrypted()) {
> 				logger.warn("Can not decrypt PDF document w/o password, skipping:"+	pdfEntry.getName());
> 				return pdfText;
> 			}
> 			// extract PDF document's textual content
> 			  pdDoc = new PDDocument(cosDoc);
> 			  PDFTextStripper stripper = new PDFTextStripper();
> 			  pdfText = stripper.getText(pdDoc);
> 		}
> 		catch (IOException e) {
> 		  pdfText = null;
> 		  logger.error("IOException in parsing PDF document: " + e);
> 		}
> 		finally{
> 			closeCOSDocument(cosDoc);
> 			closePDDocument(pdDoc);
> 		}
>                return pdfText;
> 	}
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.