You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/10/26 19:52:59 UTC
[jira] Updated: (PDFBOX-548) IOException in extracting text using
PDFBox
[ https://issues.apache.org/jira/browse/PDFBOX-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-548:
--------------------------------------
Fix Version/s: (was: 0.7.0)
> IOException in extracting text using PDFBox
> -------------------------------------------
>
> Key: PDFBOX-548
> URL: https://issues.apache.org/jira/browse/PDFBOX-548
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.0
> Reporter: Jignesh Sh
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Hi All,
> I am facing IOException in extracting text using PDFBox. PDF file I am trying to read is NOT password protected.
> Program throws IOException at the line
> pdfText = stripper.getText(pdDoc);
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> {
> boolean status = false;
> String pdfText = null;
> ZipIssueFactory issueFactory = null;
> logger.debug("Processing : " + pdfEntry.getName());
> COSDocument cosDoc = null;
> PDDocument pdDoc = null;
> try
> {
> cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));
> // skipping the PDF document, if it is encrypted
> if (cosDoc.isEncrypted()) {
> logger.warn("Can not decrypt PDF document w/o password, skipping:"+ pdfEntry.getName());
> return pdfText;
> }
> // extract PDF document's textual content
> pdDoc = new PDDocument(cosDoc);
> PDFTextStripper stripper = new PDFTextStripper();
> pdfText = stripper.getText(pdDoc); // THIS LINE THROWS IOException
> }
> catch (IOException e) {
> pdfText = null;
> logger.error("IOException in parsing PDF document: " + e);
> }
> finally{
> closeCOSDocument(cosDoc);
> closePDDocument(pdDoc);
> }
> return pdfText;
> }
> private static COSDocument parseDocument(InputStream is) throws IOException {
> PDFParser parser = new PDFParser(is);
> parser.parse();
> return parser.getDocument();
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.