You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Stefan Postema (JIRA)" <ji...@apache.org> on 2014/12/05 09:54:12 UTC
[jira] [Created] (PDFBOX-2545) ExtractText extracts filename and
date
Stefan Postema created PDFBOX-2545:
--------------------------------------
Summary: ExtractText extracts filename and date
Key: PDFBOX-2545
URL: https://issues.apache.org/jira/browse/PDFBOX-2545
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.7
Reporter: Stefan Postema
When using PDFBox 1.8 (and also a snapshot of 2.0.0), the ExtractText method produces text which also contains the original Adobe Indesign filename (and also the date and used images).
Command line example:
java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText 07-ALS-Onvoldoende-eten.pdf test.txt
The first lines of this test.txt file are:
VSN_Briefpapier_ontwerp_V03.indd 1 06-04-12 11:02
Wat kan ik doen als het niet lukt om voldoende te eten? ALS en voeding
Drinkvoeding
Which should be without the Filename and date.
When copy/pasting the text using Adobe Reader, the Indesign filename didn't show up. Using a CLI tool 'pdftotext' also didn't show up the line with the filename.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)