You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Vicente (JIRA)" <ji...@apache.org> on 2014/03/03 10:53:20 UTC
[jira] [Comment Edited] (PDFBOX-1956) Wrong character on conversion
PDF to TXT
[ https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917885#comment-13917885 ]
Vicente edited comment on PDFBOX-1956 at 3/3/14 9:51 AM:
---------------------------------------------------------
Both files have the content. A file was created by Word and B file was created by PDFCreator.
When I get file A to convert in text the result is OK but when I get file B the result is not OK. For example the original Text (Object) are converted to wrong character (2EMHFWV). Could be encoded problem ?
was (Author: vicente):
Both files have the content. A file was created by Word and B file was created by PDFCreator.
> Wrong character on conversion PDF to TXT
> ----------------------------------------
>
> Key: PDFBOX-1956
> URL: https://issues.apache.org/jira/browse/PDFBOX-1956
> Project: PDFBox
> Issue Type: Task
> Components: Parsing
> Affects Versions: 1.8.4
> Environment: Windows
> Reporter: Vicente
> Labels: parser
> Attachments: example a.pdf, example b.pdf
>
>
> I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ?
> I oberved that the problem when try to convert PDF, created by PDFCreator, in Text. The character are wrong. Any suggesting ?
> the code
> public class PDFTextParser {
>
> PDFParser parser;
> String parsedText;
> PDFTextStripper pdfStripper;
> PDDocument pdDoc;
> COSDocument cosDoc;
> PDDocumentInformation pdDocInfo;
>
> // PDFTextParser Constructor
> public PDFTextParser() {
> }
>
> // Extract text from PDF Document
> public String pdftoText(String fileName) {
>
> System.out.println("Parsing text from PDF file " + fileName + "....");
> File f = new File(fileName);
>
> if (!f.isFile()) {
> System.out.println("File " + fileName + " does not exist.");
> return null;
> }
>
> try {
> parser = new PDFParser(new FileInputStream(f));
> } catch (Exception e) {
> System.out.println("Unable to open PDF Parser.");
> return null;
> }
>
> try {
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper();
> pdDoc = new PDDocument(cosDoc);
> parsedText = pdfStripper.getText(pdDoc);
> } catch (Exception e) {
> System.out.println("An exception occured in parsing the PDF Document.");
> e.printStackTrace();
> try {
> if (cosDoc != null) cosDoc.close();
> if (pdDoc != null) pdDoc.close();
> } catch (Exception e1) {
> e.printStackTrace();
> }
> return null;
> }
> System.out.println("Done.");
> return parsedText;
> }
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)