You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Al Taylor <al...@googlemail.com.INVALID> on 2020/10/08 15:04:22 UTC

[Python] Dictionary Arrays with duplicate values jumbling on round-trip to parquet

Hi,

I've found the following odd behaviour when round-tripping data via parquet using pyarrow, when the data contains dictionary arrays with duplicate values.

```python
    import pyarrow as pa
    import pyarrow.parquet as pq

    my_table = pa.Table.from_batches(
        [
            pa.RecordBatch.from_arrays(
                [
                    pa.array([0, 1, 2, 3, 4]),
                    pa.DictionaryArray.from_arrays(
                        pa.array([0, 1, 2, 3, 4]),
                        pa.array(['a', 'd', 'c', 'd', 'e'])
                    )
                ],
                names=['foo', 'bar']
            )
        ]
    )
    my_table.validate(full=True)

    pq.write_table(my_table, "foo.parquet")

    read_table = pq.ParquetFile("foo.parquet").read()
    read_table.validate(full=True)

    print(my_table.column(1).to_pylist())
    print(read_table.column(1).to_pylist())

    assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
```

Both tables pass full validation, yet the last three lines print:
```
['a', 'd', 'c', 'd', 'e']
['a', 'd', 'c', 'e', 'a']
Traceback (most recent call last):
  File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 29, in <module>
    assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
AssertionError

```

Which clearly doesn't look right!

My question is whether I'm fundamentally breaking some assumption that dictionary values are unique or if there's a bug in the parquet-arrow conversion?

Thanks,

Al