You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/03/08 10:22:00 UTC
[jira] [Created] (TIKA-3314) Treat soft hyphens like hyphens
Tilman Hausherr created TIKA-3314:
-------------------------------------
Summary: Treat soft hyphens like hyphens
Key: TIKA-3314
URL: https://issues.apache.org/jira/browse/TIKA-3314
Project: Tika
Issue Type: Improvement
Components: tika-eval
Affects Versions: 1.25
Reporter: Tilman Hausherr
Fix For: 2.0.0, 1.26
The next PDFBox version identifies soft-hyphens (00AD) and returns them as such. Tika-eval swallows them, thus reporting differences. This can be shown with the file attached to PDFBOX-5115 in "Max-Planck-Institut".
Proposed change:
add
"\u00AD" => "-"
to
lucene-char-mapping.txt
--
This message was sent by Atlassian Jira
(v8.3.4#803005)