You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Vicente (JIRA)" <ji...@apache.org> on 2014/03/02 11:19:19 UTC
[jira] [Created] (PDFBOX-1956) Wrong character on conversion PDF to
TXT
Vicente created PDFBOX-1956:
-------------------------------
Summary: Wrong character on conversion PDF to TXT
Key: PDFBOX-1956
URL: https://issues.apache.org/jira/browse/PDFBOX-1956
Project: PDFBox
Issue Type: Task
Components: Parsing
Affects Versions: 1.8.4
Environment: Windows
Reporter: Vicente
I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ?
the code
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
public String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File(fileName);
if (!f.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unable to open PDF Parser.");
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.out.println("An exception occured in parsing the PDF Document.");
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
System.out.println("Done.");
return parsedText;
}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)