You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jignesh Sh (JIRA)" <ji...@apache.org> on 2009/10/26 14:00:59 UTC
[jira] Created: (PDFBOX-547) problem in extracting text using
PDFBox
problem in extracting text using PDFBox
---------------------------------------
Key: PDFBOX-547
URL: https://issues.apache.org/jira/browse/PDFBOX-547
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.7.0
Reporter: Jignesh Sh
Hi All,
I am facing problem in extracting text using PDFBox.
Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
Actually I am using PDFBox version PDFBox-0.6.7a.jar
Here is my code
public String getPDFContent(ZipEntry pdfEntry)
{
boolean status = false;
String pdfText = null;
ZipIssueFactory issueFactory = null;
logger.debug("Processing : " + pdfEntry.getName());
COSDocument cosDoc = null;
PDDocument pdDoc = null;
try
{
cosDoc = parseDocument(zipFile.getInputStream(pdfEntry)); // Load InputStream into memory
// skipping the PDF document, if it is encrypted
if (cosDoc.isEncrypted()) {
logger.warn("Can not decrypt PDF document w/o password, skipping:"+ pdfEntry.getName());
return pdfText;
}
// extract PDF document's textual content
pdDoc = new PDDocument(cosDoc);
PDFTextStripper stripper = new PDFTextStripper();
pdfText = stripper.getText(pdDoc);
}
catch (IOException e) {
pdfText = null;
logger.error("IOException in parsing PDF document: " + e);
}
finally{
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
}
return pdfText;
}
private static COSDocument parseDocument(InputStream is) throws IOException {
PDFParser parser = new PDFParser(is);
parser.parse();
return parser.getDocument();
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-547) problem in extracting text using
PDFBox
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-547.
---------------------------------------
Resolution: Duplicate
This issue duplicates PDFBOX-548
> problem in extracting text using PDFBox
> ---------------------------------------
>
> Key: PDFBOX-547
> URL: https://issues.apache.org/jira/browse/PDFBOX-547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.0
> Reporter: Jignesh Sh
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> {
> boolean status = false;
> String pdfText = null;
> ZipIssueFactory issueFactory = null;
> logger.debug("Processing : " + pdfEntry.getName());
> COSDocument cosDoc = null;
> PDDocument pdDoc = null;
> try
> {
> cosDoc = parseDocument(zipFile.getInputStream(pdfEntry)); // Load InputStream into memory
>
> // skipping the PDF document, if it is encrypted
> if (cosDoc.isEncrypted()) {
> logger.warn("Can not decrypt PDF document w/o password, skipping:"+ pdfEntry.getName());
> return pdfText;
> }
> // extract PDF document's textual content
> pdDoc = new PDDocument(cosDoc);
> PDFTextStripper stripper = new PDFTextStripper();
> pdfText = stripper.getText(pdDoc);
> }
> catch (IOException e) {
> pdfText = null;
> logger.error("IOException in parsing PDF document: " + e);
> }
> finally{
> closeCOSDocument(cosDoc);
> closePDDocument(pdDoc);
> }
> return pdfText;
> }
> private static COSDocument parseDocument(InputStream is) throws IOException {
> PDFParser parser = new PDFParser(is);
> parser.parse();
> return parser.getDocument();
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-547) problem in extracting text using
PDFBox
Posted by "Jignesh Sh (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778306#action_12778306 ]
Jignesh Sh commented on PDFBOX-547:
-----------------------------------
This issue is closed after I use the following 2 latest PDF box jar files
pdfbox-0.8.0-incubating.jar
fontbox-0.8.0-incubating.jar
Thanks,
Jignesh
> problem in extracting text using PDFBox
> ---------------------------------------
>
> Key: PDFBOX-547
> URL: https://issues.apache.org/jira/browse/PDFBOX-547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.0
> Reporter: Jignesh Sh
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> {
> boolean status = false;
> String pdfText = null;
> ZipIssueFactory issueFactory = null;
> logger.debug("Processing : " + pdfEntry.getName());
> COSDocument cosDoc = null;
> PDDocument pdDoc = null;
> try
> {
> cosDoc = parseDocument(zipFile.getInputStream(pdfEntry)); // Load InputStream into memory
>
> // skipping the PDF document, if it is encrypted
> if (cosDoc.isEncrypted()) {
> logger.warn("Can not decrypt PDF document w/o password, skipping:"+ pdfEntry.getName());
> return pdfText;
> }
> // extract PDF document's textual content
> pdDoc = new PDDocument(cosDoc);
> PDFTextStripper stripper = new PDFTextStripper();
> pdfText = stripper.getText(pdDoc);
> }
> catch (IOException e) {
> pdfText = null;
> logger.error("IOException in parsing PDF document: " + e);
> }
> finally{
> closeCOSDocument(cosDoc);
> closePDDocument(pdDoc);
> }
> return pdfText;
> }
> private static COSDocument parseDocument(InputStream is) throws IOException {
> PDFParser parser = new PDFParser(is);
> parser.parse();
> return parser.getDocument();
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (PDFBOX-547) problem in extracting text using PDFBox
Posted by "Jignesh Sh (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jignesh Sh closed PDFBOX-547.
-----------------------------
Closing this issue
> problem in extracting text using PDFBox
> ---------------------------------------
>
> Key: PDFBOX-547
> URL: https://issues.apache.org/jira/browse/PDFBOX-547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.0
> Reporter: Jignesh Sh
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
> {
> boolean status = false;
> String pdfText = null;
> ZipIssueFactory issueFactory = null;
> logger.debug("Processing : " + pdfEntry.getName());
> COSDocument cosDoc = null;
> PDDocument pdDoc = null;
> try
> {
> cosDoc = parseDocument(zipFile.getInputStream(pdfEntry)); // Load InputStream into memory
>
> // skipping the PDF document, if it is encrypted
> if (cosDoc.isEncrypted()) {
> logger.warn("Can not decrypt PDF document w/o password, skipping:"+ pdfEntry.getName());
> return pdfText;
> }
> // extract PDF document's textual content
> pdDoc = new PDDocument(cosDoc);
> PDFTextStripper stripper = new PDFTextStripper();
> pdfText = stripper.getText(pdDoc);
> }
> catch (IOException e) {
> pdfText = null;
> logger.error("IOException in parsing PDF document: " + e);
> }
> finally{
> closeCOSDocument(cosDoc);
> closePDDocument(pdDoc);
> }
> return pdfText;
> }
> private static COSDocument parseDocument(InputStream is) throws IOException {
> PDFParser parser = new PDFParser(is);
> parser.parse();
> return parser.getDocument();
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.