You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (JIRA)" <ji...@apache.org> on 2014/12/05 18:44:15 UTC

[jira] [Commented] (PDFBOX-2545) ExtractText extracts filename and date

    [ https://issues.apache.org/jira/browse/PDFBOX-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235780#comment-14235780 ] 

Maruan Sahyoun commented on PDFBOX-2545:
----------------------------------------

The text is part of the PDF 

{code}
BT
/CS2 cs 1  scn
/GS1 gs
/TT0 1 Tf
4.4875 0 0 4.4875 -178.0243 187.425 Tm
(VSN_Briefpapier_ontwerp_V03.indd   1)Tj
ET
{code}

> ExtractText extracts filename and date
> --------------------------------------
>
>                 Key: PDFBOX-2545
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2545
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>            Reporter: Stefan Postema
>         Attachments: 07-ALS-Onvoldoende-eten.pdf
>
>
> When using PDFBox 1.8 (and also a snapshot of 2.0.0), the ExtractText method produces text which also contains the original Adobe Indesign filename (and also the date and used images).
> Command line example:
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText 07-ALS-Onvoldoende-eten.pdf test.txt
> The first lines of this test.txt file are:
> VSN_Briefpapier_ontwerp_V03.indd   1 06-04-12   11:02
> Wat kan ik doen als het niet lukt om voldoende te eten? ALS en voeding
> Drinkvoeding
> Which should be without the Filename and date.
> When copy/pasting the text using Adobe Reader, the Indesign filename didn't show up. Using a CLI tool 'pdftotext' also didn't show up the line with the filename.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)