You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "beamliu (Jira)" <ji...@apache.org> on 2022/04/05 04:29:00 UTC

[jira] [Created] (TIKA-3714) cannot retrieve file correctly which contains non ascii char in path

beamliu created TIKA-3714:
-----------------------------

             Summary: cannot retrieve file correctly which contains non ascii char in path
                 Key: TIKA-3714
                 URL: https://issues.apache.org/jira/browse/TIKA-3714
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 2.3.0
            Reporter: beamliu


Produce:

call a rest to detect the file media type, the file exists in the file system.
{code:java}
curl --verbose -X PUT http://localhost:9998/detect/stream -H "fetcherName: minio-data" -H "fetchKey: 中文.docx" {code}
but the header fetchKey cannot be processed correctly, it will lead to FileNotFound exception, as the fetchKey cannot be correctly submitted to server.

According to RFC of the HTTP/1.1 it is not possible sending non US-ASCII symbols in the HTTP headers, but the current mechanism in tika pipe(https://cwiki.apache.org/confluence/display/TIKA/tika-pipes#FileSystemEmitter) is trying to use http header to carry the file path information, it is very common that the file path contians none ascii chars.

 

Suggest to support http parameters for fetcherName and fetchKey. The http parameters can handle none ascii chars correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)