You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/08/01 19:11:00 UTC

[jira] [Commented] (ARROW-1306) [Python] Encoding? issue with error reporting for parquet.read_table

    [ https://issues.apache.org/jira/browse/ARROW-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109551#comment-16109551 ] 

Wes McKinney commented on ARROW-1306:
-------------------------------------

This is happening because Windows unicode file names have to be encoded to UTF16-LE (see https://github.com/apache/arrow/blob/master/python/pyarrow/compat.py#L133). This definitely should be fixed -- marked for 0.6.0. I'm not sure of the right fix without taking a deeper look, since the Arrow file APIs take {{std::string}} they aren't aware of the encoded file name when they generate the error message. One way to handle it might be to add some kind of auxiliary data structure that has both the platform-encoded path and a UTF8 path. On Linux/macOS they'll be the same, but we can use the UTF8 version for making error messages

> [Python] Encoding? issue with error reporting for parquet.read_table
> --------------------------------------------------------------------
>
>                 Key: ARROW-1306
>                 URL: https://issues.apache.org/jira/browse/ARROW-1306
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.5.0
>            Reporter: Chris Bartak
>             Fix For: 0.6.0
>
>
> This is only error-reporting, somehow the filename in the exception for a not found file is getting garbled, example below
> {code}
> import pyarrow.parquet as pq
> pq.read_table('non_existent_file.parquet')
> ArrowIOError                              Traceback (most recent call last)
> pq.read_table('non_existent_file.parquet')
> ---------------------------------------------------------------------------
> ArrowIOError                              Traceback (most recent call last)
> ----> 1 pq.read_table('non_existent_file.parquet')
> ~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pyarrow\parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
>     709                                    metadata=metadata)
>     710 
> --> 711     pf = ParquetFile(source, metadata=metadata)
>     712     return pf.read(columns=columns, nthreads=nthreads,
>     713                    use_pandas_metadata=use_pandas_metadata)
> ~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pyarrow\parquet.py in __init__(self, source, metadata, common_metadata)
>      52     def __init__(self, source, metadata=None, common_metadata=None):
>      53         self.reader = ParquetReader()
> ---> 54         self.reader.open(source, metadata=metadata)
>      55         self.common_metadata = common_metadata
>      56 
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> io.pxi in pyarrow.lib.get_reader()
> io.pxi in pyarrow.lib.memory_map()
> io.pxi in pyarrow.lib.MemoryMappedFile._open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: IOError: Failed to open file: 潮彮硥獩整瑮晟汩⹥慰煲敵
> {code}
> verions - Python 3.6 Windows x64
> {code}
>     arrow-cpp:   0.5.0-np112py36_vc14_1 conda-forge [vc14]
>     parquet-cpp: 1.2.0.pre-vc14_3       conda-forge [vc14]
>     pyarrow:     0.5.0-np112py36_vc14_0 conda-forge [vc14]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)