You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2022/10/24 13:27:00 UTC

[jira] [Commented] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

    [ https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623170#comment-17623170 ] 

Alessandro Molina commented on ARROW-18123:
-------------------------------------------

The documentation states
{code:java}
the argument can be a pathlib.Path object, or a string describing an absolute local path. {code}
*absolute local path* is the key here
{code:java}
>>> f = pyarrow.fs.FileSystem.from_uri("/home/amol/ARROW/arrow/python/例.pippo")
>>> f
(<pyarrow._fs.LocalFileSystem object at 0xffff9e909470>, '/home/amol/ARROW/arrow/python/例.pippo')
>>> f[0].open_input_file(f[1]).read()
b''{code}
 

If you are willing to use a local path, you can rely on {{pathlib.Path}} for that
{code:java}
>>> f = pyarrow.fs.FileSystem.from_uri(pathlib.Path("例.pippo"))
>>> f
(<pyarrow._fs.LocalFileSystem object at 0xffffb3333270>, '/home/amol/ARROW/arrow/python/例.pippo')
>>> f[0].open_input_file(f[1])
<pyarrow.NativeFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>
>>> f[0].open_input_file(f[1]).read()
b''
{code}
 

Trying to use an actual uri (with {{file://}} schema) will result in an error by the way, and that should probably be supported too:
{code:java}
>>> f = pyarrow.fs.FileSystem.from_uri("file:///home/amol/ARROW/arrow/python/例.pippo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
    return FileSystem.wrap(GetResultValue(result)), frombytes(c_path)
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
    raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Cannot parse URI: 'file:///home/amol/ARROW/arrow/python/例.pippo'{code}
As URI are expected to be percentage encoded, I tried percent encoding the provided uri. That works as expected regarding parsing the uri, but as the file path is not decoded, it results in {{NotFound}} errors. 
{code:java}
>>> f = pyarrow.fs.FileSystem.from_uri("file:///home/amol/ARROW/arrow/python/%E4%BE%8B.pippo")
>>> f
(<pyarrow._fs.LocalFileSystem object at 0xffff9ee445f0>, '/home/amol/ARROW/arrow/python/%E4%BE%8B.pippo') >>> f[0].get_file_info(f[1])
<FileInfo for '/home/amol/ARROW/arrow/python/%E4%BE%8B.pippo': type=FileType.NotFound>
>>> f[0].open_input_file(f[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_fs.pyx", line 763, in pyarrow._fs.FileSystem.open_input_file
    in_handle = GetResultValue(self.fs.OpenInputFile(pathstr))
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
    raise IOError(errno, message)
FileNotFoundError: [Errno 2] Failed to open local file '/home/amol/ARROW/arrow/python/%E4%BE%8B.pippo'. Detail: [errno 2] No such file or directory {code}
This should probably be an issue we want to fix

 

> [Python] Cannot use multi-byte characters in file names
> -------------------------------------------------------
>
>                 Key: ARROW-18123
>                 URL: https://issues.apache.org/jira/browse/ARROW-18123
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: SHIMA Tatsuya
>            Priority: Major
>
> Error when specifying a file path containing multi-byte characters in {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...                    'two': ['foo', 'bar', 'baz'],
> ...                    'three': [True, False, True]},
> ...                    index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
>     with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
>     filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
>     filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)