You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Phillip Cloud (JIRA)" <ji...@apache.org> on 2018/01/09 16:45:00 UTC
[jira] [Commented] (ARROW-1976) Handling unicode pandas columns on
pq.read_table
[ https://issues.apache.org/jira/browse/ARROW-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318720#comment-16318720 ]
Phillip Cloud commented on ARROW-1976:
--------------------------------------
Note this is python 2 specific. You won't run into issues like this if you don't use python 2.
If there's no restriction on the version of Python you need to use please use python 3.
That said, since we have to support python 2, this is a bug.
How is it possible to read in Unicode from a CSV file without specifying an encoding to {{read_csv}}? Pandas must make an assumption about the encoding or choose a default.
I've also submitted a data issue to data.gov to request that they include the encoding in the metadata.
https://www.data.gov/issue/request-id/635154
> Handling unicode pandas columns on pq.read_table
> ------------------------------------------------
>
> Key: ARROW-1976
> URL: https://issues.apache.org/jira/browse/ARROW-1976
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.8.0
> Reporter: Simbarashe Nyatsanga
>
> Unicode columns in pandas DataFrames aren't being handled correctly for some datasets when reading a parquet file into a pandas DataFrame, leading to the common Python ASCII encoding error.
>
> The dataset used to get the error is here: https://catalog.data.gov/dataset/college-scorecard
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('college_data.csv')
> {code}
> For verification, the DataFrame's columns are indeed unicode
> {code}
> df.columns
> > Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR',
> u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG',
> ...
> u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6',
> u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP',
> u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'],
> dtype='object', length=123)
> {code}
> The DataFrame can be saved into a parquet file
> {code}
> arrow_table = pa.Table.from_pandas(df)
> pq.write_table(arrow_table, 'college_data.parquet')
> {code}
> But trying to read the parquet file immediately afterwards results in the following
> {code}
> df = pq.read_table('college_data.parquet').to_pandas()
> > ---------------------------------------------------------------------------
> UnicodeEncodeError Traceback (most recent call last)
> <ipython-input-29-23906ea1efe3> in <module>()
> ----> 2 df = pq.read_table('college_data.parquet').to_pandas()
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)()
> 1041 if nthreads is None:
> 1042 nthreads = cpu_count()
> -> 1043 mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
> 1044 nthreads)
> 1045 return pd.DataFrame(mgr)
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
> 539 if columns:
> 540 columns_name_dict = {
> --> 541 c.get('field_name', str(c['name'])): c['name'] for c in columns
> 542 }
> 543 columns_values = [
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in <dictcomp>((c,))
> 539 if columns:
> 540 columns_name_dict = {
> --> 541 c.get('field_name', str(c['name'])): c['name'] for c in columns
> 542 }
> 543 columns_values = [
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
> {code}
> Looking at the stacktrace , it looks like this line, which is using str which by default will try to do ascii encoding: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)