You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 01:59:48 UTC
[jira] [Updated] (PDFBOX-1706) Reading PDF documents that contain special characters (e.g. €) cause warning and invalid parse result
[ https://issues.apache.org/jira/browse/PDFBOX-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-1706:
--------------------------------
Fix Version/s: 2.0.0
> Reading PDF documents that contain special characters (e.g. €) cause warning and invalid parse result
> -----------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1706
> URL: https://issues.apache.org/jira/browse/PDFBOX-1706
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.2, 2.0.0
> Environment: Windows
> Reporter: Robert Neumann
> Labels: patch
> Fix For: 2.0.0
>
>
> When trying to call stripper.getText on the PDF file http://www.edi-energy.de/files2/EDI@Energy%20UTILMD%205.1_20130401.pdf, PDFBox 1.8.2 emits the following warning:
> 08:48:20,222 WARN PDFStreamEngine:567 - java.io.IOException: Error: Could not find font(COSName{F7}) in map={F1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@676825b5, F2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@547e97d8}
> java.io.IOException: Error: Could not find font(COSName{F7}) in map={F1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@676825b5, F2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@547e97d8}
> at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:57)
> at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
> at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
> at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> Interestingly, PDFBox 2.0 emits a different warning that calls out the problem more precisely:
> Aug 27, 2013 9:35:30 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont extractToUnicodeEncoding
> SEVERE: Error: Could not load embedded ToUnicode CMap
> Aug 27, 2013 9:35:30 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont getSpaceWidth
> SEVERE: Can't determine the width of the space character using 250 as default
> java.lang.NullPointerException
> at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:406)
> at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
> at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
> at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:529)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225)
> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
> at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
> at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> We could trace the problem down to reading pages that contain special characters (e.g. €). In the referenced PDF document, pages that do not contain special characters (e.g. €) do not cause the above mentioned warning. The text parts in the document that cause the warning do not get parsed correctly. The parse result contains byte rubbish.
> Adobe reader displays the entire document correctly.
> The following snippet should serve as a repro:
> package com.regiocom.bpo.mig;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
> import org.apache.pdfbox.util.Splitter;
> public class Repro {
>
> public Repro() {
>
> try {
> stripper = new PDFTextStripper();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> // use this PDF as input: http://www.edi-energy.de/files2/EDI@Energy%20UTILMD%205.1_20130401.pdf
> public void run(String pdfFile) {
>
> PDDocument[] documents = loadAndSplitFile(pdfFile, 1);
>
> for(PDDocument document : documents) {
> parse(document);
> }
> }
>
> private PDDocument[] loadAndSplitFile(String pdfFile, int splitPage) {
>
> List<PDDocument> documents;
> Splitter splitter = new Splitter();
> PDFParser parser;
>
> try {
> parser = new PDFParser(new FileInputStream(new File(pdfFile)));
> parser.parse();
>
> PDDocument doc = parser.getPDDocument();
>
> splitter.setSplitAtPage(splitPage);
>
> documents = splitter.split(doc);
>
> doc.close();
>
> return documents.toArray(new PDDocument[]{});
> } catch (FileNotFoundException e) {
> e.printStackTrace();
>
> } catch (IOException e) {
> e.printStackTrace();
> }
>
> return null;
> }
>
> private void parse(PDDocument pdfFile) {
> try {
> stripper.getText(pdfFile);
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
>
> private PDFTextStripper stripper;
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)