You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/03/08 10:22:00 UTC

[jira] [Created] (TIKA-3314) Treat soft hyphens like hyphens

Tilman Hausherr created TIKA-3314:
-------------------------------------

             Summary: Treat soft hyphens like hyphens
                 Key: TIKA-3314
                 URL: https://issues.apache.org/jira/browse/TIKA-3314
             Project: Tika
          Issue Type: Improvement
          Components: tika-eval
    Affects Versions: 1.25
            Reporter: Tilman Hausherr
             Fix For: 2.0.0, 1.26


The next PDFBox version identifies soft-hyphens (00AD) and returns them as such. Tika-eval swallows them, thus reporting differences. This can be shown with the file attached to PDFBOX-5115 in "Max-Planck-Institut".

Proposed change:
add

"\u00AD" => "-"

to 
lucene-char-mapping.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)