You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/10/04 22:05:00 UTC
[jira] [Resolved] (TIKA-3864) Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
[ https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-3864.
-------------------------------
Fix Version/s: 2.5.1
Resolution: Fixed
> Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
> -------------------------------------------------------------------------
>
> Key: TIKA-3864
> URL: https://issues.apache.org/jira/browse/TIKA-3864
> Project: Tika
> Issue Type: Bug
> Components: tika-pipes, tika-server
> Affects Versions: 2.4.1
> Environment: debian:bullseye docker container running tika-server-standard-2.4.1jar
> Reporter: Tong Wang
> Priority: Major
> Fix For: 2.5.1
>
>
> When use FileSystemFetcher, if there is non-ascii characters in fetchKey, Tika Server throws exception because the file name is incorrect. Here is an example:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: 中文.txt" {code}
> I get java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/ä¸æ.txt at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64) at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90) at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159) {code}
>
> When I try to quote the characters:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
> I still get a java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/%E4%B8%AD%E6%96%87.txt at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64) at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90) at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
> BTW, locale is set to C.UTF-8 on Tika Server:
> {code:java}
> # locale
> LANG=C.UTF-8
> LANGUAGE=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_PAPER="C.UTF-8"
> LC_NAME="C.UTF-8"
> LC_ADDRESS="C.UTF-8"
> LC_TELEPHONE="C.UTF-8"
> LC_MEASUREMENT="C.UTF-8"
> LC_IDENTIFICATION="C.UTF-8"
> LC_ALL= {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)