You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Todd Dixon (JIRA)" <ji...@apache.org> on 2018/03/02 17:18:00 UTC
[jira] [Created] (TIKA-2597) Attachment Extraction Case Sensitivity
Todd Dixon created TIKA-2597:
--------------------------------
Summary: Attachment Extraction Case Sensitivity
Key: TIKA-2597
URL: https://issues.apache.org/jira/browse/TIKA-2597
Project: Tika
Issue Type: Bug
Components: app
Affects Versions: 1.17
Environment: windows
Reporter: Todd Dixon
Using the --extract option on a pdf with embedded files I am seeing that not all of the attachments are extracted. There are several files embedded that contain the same name. The names that are exactly the same are accounted for with a suffix of (1) etc. However when there is a similar name that is not the same case the parse does not account for changing the name with the suffix and thus overwrites the file on disk. Example
FW Letter,.msg
FW letter.msg
Will result in only one attachment extracted. Would it be possible to update the filename comparison to account for windows file systems which see those two files as the same name?
Thanks!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)