You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/01/24 04:02:01 UTC
[jira] [Assigned] (ARROW-1976) [Python] Handling unicode pandas columns on parquet.read_table

     [ https://issues.apache.org/jira/browse/ARROW-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney reassigned ARROW-1976:
-----------------------------------

    Assignee: Licht Takeuchi

> [Python] Handling unicode pandas columns on parquet.read_table
> --------------------------------------------------------------
>
>                 Key: ARROW-1976
>                 URL: https://issues.apache.org/jira/browse/ARROW-1976
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Simbarashe Nyatsanga
>            Assignee: Licht Takeuchi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> Unicode columns in pandas DataFrames aren't being handled correctly for some datasets when reading a parquet file into a pandas DataFrame, leading to the common Python ASCII encoding error.
>  
> The dataset used to get the error is here: https://catalog.data.gov/dataset/college-scorecard
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('college_data.csv')
> {code}
> For verification, the DataFrame's columns are indeed unicode
> {code}
> df.columns
> > Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR',
>        u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG',
>        ...
>        u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6',
>        u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP',
>        u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'],
>       dtype='object', length=123)
> {code}
> The DataFrame can be saved into a parquet file
> {code}
> arrow_table = pa.Table.from_pandas(df)
> pq.write_table(arrow_table, 'college_data.parquet')
> {code}
> But trying to read the parquet file immediately afterwards results in the following
> {code}
> df = pq.read_table('college_data.parquet').to_pandas()
> > ---------------------------------------------------------------------------
> UnicodeEncodeError                        Traceback (most recent call last)
> <ipython-input-29-23906ea1efe3> in <module>()
> ----> 2 df = pq.read_table('college_data.parquet').to_pandas()
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)()
>    1041         if nthreads is None:
>    1042             nthreads = cpu_count()
> -> 1043         mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
>    1044                                              nthreads)
>    1045         return pd.DataFrame(mgr)
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
>     539     if columns:
>     540         columns_name_dict = {
> --> 541             c.get('field_name', str(c['name'])): c['name'] for c in columns
>     542         }
>     543         columns_values = [
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in <dictcomp>((c,))
>     539     if columns:
>     540         columns_name_dict = {
> --> 541             c.get('field_name', str(c['name'])): c['name'] for c in columns
>     542         }
>     543         columns_values = [
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
> {code}
> Looking at the stacktrace , it looks like this line, which is using str which by default will try to do ascii encoding: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)