You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Matt Jadczak (Jira)" <ji...@apache.org> on 2020/10/09 11:03:00 UTC

[jira] [Updated] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present

     [ https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Jadczak updated ARROW-10246:
---------------------------------
    Component/s: Python
                 C++

> [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10246
>                 URL: https://issues.apache.org/jira/browse/ARROW-10246
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Matt Jadczak
>            Priority: Major
>
> Copying this from [the mailing list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
> We can observe the following odd behaviour when round-tripping data via parquet using pyarrow, when the data contains dictionary arrays with duplicate values.
>  
> {code:java}
> import pyarrow as pa
>  import pyarrow.parquet as pq
> my_table = pa.Table.from_batches(
>  [
>  pa.RecordBatch.from_arrays(
>  [
>  pa.array([0, 1, 2, 3, 4]),
>  pa.DictionaryArray.from_arrays(
>  pa.array([0, 1, 2, 3, 4]),
>  pa.array(['a', 'd', 'c', 'd', 'e'])
>  )
>  ],
>  names=['foo', 'bar']
>  )
>  ]
>  )
>  my_table.validate(full=True)
> pq.write_table(my_table, "foo.parquet")
> read_table = pq.ParquetFile("foo.parquet").read()
>  read_table.validate(full=True)
> print(my_table.column(1).to_pylist())
>  print(read_table.column(1).to_pylist())
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> {code}
> Both tables pass full validation, yet the last three lines print:
> {code:java}
> ['a', 'd', 'c', 'd', 'e']
> ['a', 'd', 'c', 'e', 'a']
> Traceback (most recent call last):
>  File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 29, in <module>
>  assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> AssertionError{code}
> Which clearly doesn't look right!
>  
> It seems to me that the reason this is happening is that when re-encoding an Arrow dictionary as a Parquet one, the function at
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
> is called to create a Parquet DictEncoder out of the Arrow dictionary data. This internally uses a map from value to index, and this map is constructed by continually calling GetOrInsert on a memo table. When called with duplicate values as in Al's example, the duplicates do not cause a new dictionary index to be allocated, but instead return the existing one (which is just ignored). However, the caller assumes that the resulting Parquet dictionary uses the exact same indices as the Arrow one, and proceeds to just copy the index data directly. In Al's example, this results in an invalid dictionary index being written (that it is somehow wrapped around when reading again, rather than crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)