You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/27 14:53:00 UTC
[jira] [Created] (ARROW-11400) [Python] Pickled ParquetFileFragment
has invalid partition_expresion with dictionary type in pyarrow 2.0
Joris Van den Bossche created ARROW-11400:
---------------------------------------------
Summary: [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0
Key: ARROW-11400
URL: https://issues.apache.org/jira/browse/ARROW-11400
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
From https://github.com/dask/dask/pull/7066#issuecomment-767156623
Simplified reproducer:
{code:python}
import pyarrow.parquet as pq
import pyarrow.dataset as ds
table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
pq.write_to_dataset(table, "test_partitioned_parquet", partition_cols=["part"])
# with partitioning_kwargs = {} there is no error
partitioning_kwargs = {"max_partition_dictionary_size": -1}
dataset = ds.dataset(
"test_partitioned_parquet/", format="parquet",
partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
)
frag = list(dataset.get_fragments())[0]
{code}
Querying this fragment works fine, but after serialization/deserialization with pickle, it gives errors (and with the original data example I actually got a segfault as well):
{code}
In [16]: import pickle
In [17]: frag2 = pickle.loads(pickle.dumps(frag))
In [19]: frag2.partition_expression
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: invalid continuation byte
In [20]: frag2.to_table(schema=schema, columns=columns)
Out[20]:
pyarrow.Table
col: int64
part: dictionary<values=string, indices=int32, ordered=0>
In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
...
~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()
ArrowException: Unknown error: Wrapping ɻ� failed
{code}
It seems the issue was specifically with a partition expression with dictionary type.
Also when using an integer columns as the partition column, you get wrong values (but silently in this case):
{code:python}
In [42]: frag.partition_expression
Out[42]:
<pyarrow.dataset.Expression (part == [
1,
2
][0]:dictionary<values=int32, indices=int32, ordered=0>)>
In [43]: frag2.partition_expression
Out[43]:
<pyarrow.dataset.Expression (part == [
170145232,
32754
][0]:dictionary<values=int32, indices=int32, ordered=0>)>
{code}
Now, it seems this is fixed in master. But since I don't remember it was fixed intentionally ([~bkietz]?), it would be good to add some tests for it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)