You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/10/22 09:24:00 UTC

[jira] [Closed] (ARROW-9801) DictionaryArray with non-unique values are silently corrupted when written to a Parquet file

     [ https://issues.apache.org/jira/browse/ARROW-9801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche closed ARROW-9801.
----------------------------------------
    Resolution: Duplicate

> DictionaryArray with non-unique values are silently corrupted when written to a Parquet file
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9801
>                 URL: https://issues.apache.org/jira/browse/ARROW-9801
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>         Environment: pyarrow 1.0.0 installed from conda-forge.
>            Reporter: Jim Pivarski
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Suppose that you have a DictionaryArray with repeated values in the dictionary:
> {{>>> import pyarrow as pa}}
> {{>>> pa_array = pa.DictionaryArray.from_arrays(}}
> {{...     pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),}}
> {{...     pa.array(["one", "two", "three", "one", "two", "three"])}}
> {{... )}}
> {{>>> pa_array}}
> {{<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>}}{{-- dictionary:}}
> {{ [}}
> {{    "one",}}
> {{    }}{{"two",}}
> {{    }}{{"three",}}
> {{    }}{{"one",}}
> {{    }}{{"two",}}
> {{    }}{{"three"}}
> {{ ]}}
> {{-- indices:}}
> {{ [}}
> {{    }}{{0,}}
> {{    }}{{1,}}
> {{    }}{{2,}}
> {{    }}{{3,}}
> {{    }}{{4,}}
> {{    }}{{5,}}
> {{    }}{{0,}}
> {{    }}{{1,}}
> {{    }}{{2,}}
> {{    }}{{3,}}
> {{    }}{{4,}}
> {{    }}{{5}}
> {{ ]}}
> According to [the documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],
> {quote}Dictionary encoding is a data representation technique to represent values by integers referencing a *dictionary* usually consisting of unique values.
> {quote}
> so a DictionaryArray like the one above is arguably invalid, but if so, then I'd expect some error messages, rather than corrupt data, when I try to write it to a Parquet file.
> {{>>> pa_table = pa.Table.from_batches(}}
> {{...     [pa.RecordBatch.from_arrays([pa_array], ["column"])]}}
> {{... )}}
> {{>>> pa_table}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int64, ordered=0>}}
> {{>>> import pyarrow.parquet}}
> {{>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")}}
> No errors so far. So we try to read it back and view it:
> {{​>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")}}
> {{>>> pa_loaded}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int32, ordered=0>}}
> {{>>> pa_loaded.to_pydict()}}
> {{Traceback (most recent call last):}}
> {{ File "<stdin>", line 1, in <module>}}
> {{ File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict}}
> {{ File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist}}
> {{ File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist}}
> {{ File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py}}
> {{ File "pyarrow/scalar.pxi", line 701, in pyarrow.lib.DictionaryScalar.value.__get__}}
> {{ File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status}}
> {{ File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status}}
> {{pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 long}}
> Looking more closely at this, we see that the dictionary has been minimized to include only unique values, but the indices haven't been correctly translated:
> {{>>> pa_loaded["column"]}}
> {{<pyarrow.lib.ChunkedArray object at 0x7f0a8fb16a90>}}
> {{[}}
> {{    }}{{}}{{-- dictionary:}}
> {{    }}{{[}}
> {{    }}{{    }}{{"one",}}
> {{    }}{{    }}{{"two",}}
> {{    }}{{    }}{{"three"}}
> {{    }}{{]}}
> {{    }}{{-- indices:}}
> {{    }}{{[}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{2,}}
> {{    }}{{    }}{{3,}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{2,}}
> {{    }}{{    }}{{3,}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1}}
> {{    }}{{]}}
> {{]}}
> It looks like an attempt was made to minimize it, but the indices ought to be
> [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
> I don't know what your preferred course of action is—adding an error message or fixing the attempted conversion—but this is wrong. On my side, I'm adding code to prevent the creation of non-unique values in DictionaryArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)