You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2018/03/06 08:14:00 UTC
[jira] [Comment Edited] (PDFBOX-4141) Suppress control characters?

    [ https://issues.apache.org/jira/browse/PDFBOX-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387439#comment-16387439 ] 

Andreas Meier edited comment on PDFBOX-4141 at 3/6/18 8:13 AM:
---------------------------------------------------------------

Thanks for the info Tilman.

Overriding the characters in writeCharacters will not be the problem.
The main question is if this shall be possible to turn on/off by a switch implemented in the master of PDFBox, since Adobe themself do some replacements in their text extraction methods. 


Whether changes towards replacing/overriding c1/c0 control codes are implemented or not, I checked the Adobe Reader output upon c0 and c1 control codes and ended up with the attached list.


Notice that for some reason U+0007 is converted to U+0009 and U+000B to U+000A, this might be overlooked. (double-checked this, because I first thought of a mistake...)

The file is separated into c0 and c1 codes as well as two codes for space (U+0020) and del (U+007F)

Even if a feature like this will not be implemented by default the mapping list might help some people out there.


was (Author: andreasmeier):
Thanks for the info Tilman.

Overriding the characters in writeCharacters will not be the problem.
The main question is if this shall be possible to turn on/off by a switch implemented in the master of PDFBox, since Adobe themself do some replacements in their text extraction methods. 


Whether changes towards replacing/overriding c1/c0 control codes are implemented or not, I checked the Adobe Reader output upon c0 and c1 control codes and ended up with the attached list.


Notice that for some reason U+0007 is converted to U+0009 and U+000B to U+000A, this might be overlooked. (double-checked this, because I first thought of a mistake...)

The file is separated into c0 and c1 codes as well as space (U+0020) and del (U+007F)

Even if a feature like this will not be implemented by Default the mapping list might help some people out there.

> Suppress control characters?
> ----------------------------
>
>                 Key: PDFBOX-4141
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4141
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: Mapping_default_to_adobe.csv, Test_with_MW.pdf, Test_with_MW.txt, Test_with_MW_AdobeReader_export.txt, Test_with_MW_linux.jpg, Test_without_MW.txt
>
>
> At the moment pdfbox extracts all types of characters.
> Therefore control characters that occur will also be extracted.
> Unfortunately some of these control characters might deform text.
> For example 'MESSAGE WAITING' (U+0095) [MW]
> I attached some files and a screenshot how text is printed when MESSAGE WAITING is present.
> Should PDFBox handle this type of characters? Maybe suppress them in PDFTextStripper?
> I know that PDFBox works correctly in this case, a feature to turn off or suppress special characters might produce better output than the default Setting unless some control characters are used for any further processing!?
> Feedback appreciated.
> What other programs do:
> a) ignore control characters (Okular PDF Viewer - KDE)
> b) exchange them  (Adobe Reader wrote a dot "." in place of MW)
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org