You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/10/04 22:05:00 UTC

[jira] [Resolved] (TIKA-3864) Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher

     [ https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-3864.
-------------------------------
    Fix Version/s: 2.5.1
       Resolution: Fixed

> Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3864
>                 URL: https://issues.apache.org/jira/browse/TIKA-3864
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-pipes, tika-server
>    Affects Versions: 2.4.1
>         Environment: debian:bullseye docker container running tika-server-standard-2.4.1jar
>            Reporter: Tong Wang
>            Priority: Major
>             Fix For: 2.5.1
>
>
> When use FileSystemFetcher, if there is non-ascii characters in fetchKey, Tika Server throws exception because the file name is incorrect. Here is an example:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: 中文.txt" {code}
> I get java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/ä¸æ–‡.txt	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)	at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)	at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)	at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)	at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159) {code}
>  
> When I try to quote the characters:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
> I still get a java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/%E4%B8%AD%E6%96%87.txt	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)	at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)	at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)	at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)	at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
> BTW, locale is set to C.UTF-8 on Tika Server:
> {code:java}
> # locale
> LANG=C.UTF-8
> LANGUAGE=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_PAPER="C.UTF-8"
> LC_NAME="C.UTF-8"
> LC_ADDRESS="C.UTF-8"
> LC_TELEPHONE="C.UTF-8"
> LC_MEASUREMENT="C.UTF-8"
> LC_IDENTIFICATION="C.UTF-8"
> LC_ALL= {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)