You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Jignesh Sh (JIRA)" <ji...@apache.org> on 2009/10/26 14:00:59 UTC

[jira] Created: (PDFBOX-547) problem in extracting text using PDFBox

problem in extracting text using PDFBox
---------------------------------------

                 Key: PDFBOX-547
                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.7.0
            Reporter: Jignesh Sh


Hi All,
I am facing problem in extracting text using PDFBox.
Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
Actually I am using PDFBox version PDFBox-0.6.7a.jar
Here is my code

public String getPDFContent(ZipEntry pdfEntry)
	{
		boolean status = false;
		String pdfText = null;
               ZipIssueFactory issueFactory = null;
               logger.debug("Processing : " + pdfEntry.getName());
		COSDocument cosDoc = null;
		PDDocument pdDoc = null;
		try
		{
			cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream into memory
		 
			// skipping the PDF document, if it is encrypted
			if (cosDoc.isEncrypted()) {
				logger.warn("Can not decrypt PDF document w/o password, skipping:"+	pdfEntry.getName());
				return pdfText;
			}
			// extract PDF document's textual content
			  pdDoc = new PDDocument(cosDoc);
			  PDFTextStripper stripper = new PDFTextStripper();
			  pdfText = stripper.getText(pdDoc);
		}
		catch (IOException e) {
		  pdfText = null;
		  logger.error("IOException in parsing PDF document: " + e);
		}
		finally{
			closeCOSDocument(cosDoc);
			closePDDocument(pdDoc);
		}
               return pdfText;
	}
private static COSDocument parseDocument(InputStream is) throws IOException {
          PDFParser parser = new PDFParser(is);
          parser.parse();
          return parser.getDocument();
       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-547) problem in extracting text using PDFBox

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-547.
---------------------------------------

    Resolution: Duplicate

This issue duplicates PDFBOX-548

> problem in extracting text using PDFBox
> ---------------------------------------
>
>                 Key: PDFBOX-547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> 	{
> 		boolean status = false;
> 		String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
> 		COSDocument cosDoc = null;
> 		PDDocument pdDoc = null;
> 		try
> 		{
> 			cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream into memory
> 		 
> 			// skipping the PDF document, if it is encrypted
> 			if (cosDoc.isEncrypted()) {
> 				logger.warn("Can not decrypt PDF document w/o password, skipping:"+	pdfEntry.getName());
> 				return pdfText;
> 			}
> 			// extract PDF document's textual content
> 			  pdDoc = new PDDocument(cosDoc);
> 			  PDFTextStripper stripper = new PDFTextStripper();
> 			  pdfText = stripper.getText(pdDoc);
> 		}
> 		catch (IOException e) {
> 		  pdfText = null;
> 		  logger.error("IOException in parsing PDF document: " + e);
> 		}
> 		finally{
> 			closeCOSDocument(cosDoc);
> 			closePDDocument(pdDoc);
> 		}
>                return pdfText;
> 	}
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-547) problem in extracting text using PDFBox

Posted by "Jignesh Sh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778306#action_12778306 ] 

Jignesh Sh commented on PDFBOX-547:
-----------------------------------

This issue is closed after I use the following 2 latest PDF box jar files
pdfbox-0.8.0-incubating.jar
fontbox-0.8.0-incubating.jar

Thanks,
Jignesh

> problem in extracting text using PDFBox
> ---------------------------------------
>
>                 Key: PDFBOX-547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> 	{
> 		boolean status = false;
> 		String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
> 		COSDocument cosDoc = null;
> 		PDDocument pdDoc = null;
> 		try
> 		{
> 			cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream into memory
> 		 
> 			// skipping the PDF document, if it is encrypted
> 			if (cosDoc.isEncrypted()) {
> 				logger.warn("Can not decrypt PDF document w/o password, skipping:"+	pdfEntry.getName());
> 				return pdfText;
> 			}
> 			// extract PDF document's textual content
> 			  pdDoc = new PDDocument(cosDoc);
> 			  PDFTextStripper stripper = new PDFTextStripper();
> 			  pdfText = stripper.getText(pdDoc);
> 		}
> 		catch (IOException e) {
> 		  pdfText = null;
> 		  logger.error("IOException in parsing PDF document: " + e);
> 		}
> 		finally{
> 			closeCOSDocument(cosDoc);
> 			closePDDocument(pdDoc);
> 		}
>                return pdfText;
> 	}
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (PDFBOX-547) problem in extracting text using PDFBox

Posted by "Jignesh Sh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jignesh Sh closed PDFBOX-547.
-----------------------------


Closing this issue

> problem in extracting text using PDFBox
> ---------------------------------------
>
>                 Key: PDFBOX-547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> 	{
> 		boolean status = false;
> 		String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
> 		COSDocument cosDoc = null;
> 		PDDocument pdDoc = null;
> 		try
> 		{
> 			cosDoc = parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream into memory
> 		 
> 			// skipping the PDF document, if it is encrypted
> 			if (cosDoc.isEncrypted()) {
> 				logger.warn("Can not decrypt PDF document w/o password, skipping:"+	pdfEntry.getName());
> 				return pdfText;
> 			}
> 			// extract PDF document's textual content
> 			  pdDoc = new PDDocument(cosDoc);
> 			  PDFTextStripper stripper = new PDFTextStripper();
> 			  pdfText = stripper.getText(pdDoc);
> 		}
> 		catch (IOException e) {
> 		  pdfText = null;
> 		  logger.error("IOException in parsing PDF document: " + e);
> 		}
> 		finally{
> 			closeCOSDocument(cosDoc);
> 			closePDDocument(pdDoc);
> 		}
>                return pdfText;
> 	}
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.