You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2019/09/26 03:03:00 UTC
[jira] [Updated] (ARROW-4883) [Python] read_csv() returns garbage if given file object in text mode

     [ https://issues.apache.org/jira/browse/ARROW-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Kornfield updated ARROW-4883:
-----------------------------------
    Fix Version/s: 0.15.0

> [Python] read_csv() returns garbage if given file object in text mode
> ---------------------------------------------------------------------
>
>                 Key: ARROW-4883
>                 URL: https://issues.apache.org/jira/browse/ARROW-4883
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.1
>         Environment: Python: 3.7.2, 2.7.15
> PyArrow: 0.12.1
> OS: MacOS 10.13.6 (High Sierra)
>            Reporter: Diego Argueta
>            Priority: Major
>              Labels: csv
>             Fix For: 0.15.0
>
>
> h1. Summary:
> Python 3:
> * {{read_csv}} returns mojibake if given file objects opened in text mode. It behaves as expected in binary mode.
> * Files encoded in anything other than valid UTF-8 will cause a crash.
> Python 2:
> {{read_csv}} only handles ASCII files. If given a file in UTF-8 with characters over U+007F, it crashes.
> h1. To reproduce:
> 1) Create a CSV like this
> {code}
> Header
> 123.45
> {code}
> 2) Then run this code on Python 3:
> {code:python}
> >>> import pyarrow.csv as pa_csv
> >>> pa_csv.read_csv(open('test.csv', 'r'))
> pyarrow.Table
> 䧢: string
> {code}
> Notice the file descriptor is open in text mode. Changing the encoding doesn't help:
> {code:python}
> >>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
> pyarrow.Table
> 䧢: string
> >>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
> pyarrow.Table
> 䧢: string
> >>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
> pyarrow.Table
> 䧢: string
> {code}
> If I open the file in binary mode it works:
> {code:python}
> >>> pa_csv.read_csv(open('test.csv', 'rb'))                                                                                                                             
> pyarrow.Table
> Header: double
> {code}
> I tried this with a file encoded in UTF-16 and it freaked out:
> {code}                                                                                                                  
> Traceback (most recent call last):
>   File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 84, in _process_text
>     self._execute(line)
>   File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 139, in _execute
>     result_str = '%s\n' % repr(result).decode('utf-8')
>   File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
>   File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
>   File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py", line 143, in frombytes
>     return o.decode('utf8')
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
> 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
> {code}
> Presumably this is because the code always assumes the file is in UTF-8.
> h2. Python 2 behavior
> Python 2 behaves differently -- it uses the ASCII codec by default, so when handed a file encoded in UTF-8, it will return without an error. Try to access the table...
> {code}
> >>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))
> >>> list(t)
> Traceback (most recent call last):
>   File "/<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 84, in _process_text
>     self._execute(line)
>   File "<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 139, in _execute
>     result_str = '%s\n' % repr(result).decode('utf-8')
>   File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
>     result.write('\n{}'.format(str(self.data)))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
> 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
> {code}
> h1. Expectation
> We should be able to hand read_csv() a file in text mode so that the CSV file can be in any text encoding. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)