You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/20 14:44:00 UTC

[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

    [ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868583#comment-16868583 ] 

Joris Van den Bossche commented on ARROW-5666:
----------------------------------------------

Thanks for the report!

The problem is that we try to convert the keys to integer, and if that fails just preserve them as strings. 
That is done here https://github.com/apache/arrow/blob/961927af56b83d0dbca91132c3f07aa06d69fc63/python/pyarrow/parquet.py#L659-L663

{code}
        # Only integer and string partition types are supported right now
        try:
            integer_keys = [int(x) for x in self.keys]
            dictionary = lib.array(integer_keys)
        except ValueError:
            dictionary = lib.array(self.keys)
{code}

and apparently, Python will convert a string with an underscore to an integer ...

{code}
In [3]: int("2019_1")                                                                                                                                                                                              
Out[3]: 20191
{code}

I think this is because in recent Python versions underscores are allowed in integer literals (eg to separate thousands). 
We could special case this and first check if there is an underscore in the string before trying to convert to integers, but that's a big ugly.

> [Python] Underscores in partition (string) values are dropped when reading dataset
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-5666
>                 URL: https://issues.apache.org/jira/browse/ARROW-5666
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>            Reporter: Julian de Ruiter
>            Priority: Major
>
> When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
> <Column name='year_week' type=DictionaryType(dictionary<values=int64, indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>       20192,
>       20193
>     ]
>   -- indices:
>     [
>       0
>     ],
>   -- dictionary:
>     [
>       20192,
>       20193
>     ]
>   -- indices:
>     [
>       1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)