You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Al Taylor <al...@googlemail.com.INVALID> on 2020/10/08 15:04:22 UTC
[Python] Dictionary Arrays with duplicate values jumbling on round-trip to parquet
Hi,
I've found the following odd behaviour when round-tripping data via parquet using pyarrow, when the data contains dictionary arrays with duplicate values.
```python
import pyarrow as pa
import pyarrow.parquet as pq
my_table = pa.Table.from_batches(
[
pa.RecordBatch.from_arrays(
[
pa.array([0, 1, 2, 3, 4]),
pa.DictionaryArray.from_arrays(
pa.array([0, 1, 2, 3, 4]),
pa.array(['a', 'd', 'c', 'd', 'e'])
)
],
names=['foo', 'bar']
)
]
)
my_table.validate(full=True)
pq.write_table(my_table, "foo.parquet")
read_table = pq.ParquetFile("foo.parquet").read()
read_table.validate(full=True)
print(my_table.column(1).to_pylist())
print(read_table.column(1).to_pylist())
assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
```
Both tables pass full validation, yet the last three lines print:
```
['a', 'd', 'c', 'd', 'e']
['a', 'd', 'c', 'e', 'a']
Traceback (most recent call last):
File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 29, in <module>
assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
AssertionError
```
Which clearly doesn't look right!
My question is whether I'm fundamentally breaking some assumption that dictionary values are unique or if there's a bug in the parquet-arrow conversion?
Thanks,
Al