You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (JIRA)" <ji...@apache.org> on 2019/06/03 14:13:00 UTC

[jira] [Resolved] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

     [ https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou resolved ARROW-5430.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 0.14.0

Issue resolved by pull request 4440
[https://github.com/apache/arrow/pull/4440]

> [Python] Can read but not write parquet partitioned on large ints
> -----------------------------------------------------------------
>
>                 Key: ARROW-5430
>                 URL: https://issues.apache.org/jira/browse/ARROW-5430
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>            Reporter: Robin Kåveland
>            Priority: Minor
>              Labels: parquet, pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read
>     **kwargs).to_pandas()
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py", line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to C long{code}
> I set the priority to minor here because it's easy enough to work around this in user code unless you really need the 64 bit hash (and you probably shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)