You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/03/12 12:16:00 UTC
[jira] [Created] (ARROW-8087) [C++][Dataset] Order of keys with
HivePartitioning is lost in resulting schema
Joris Van den Bossche created ARROW-8087:
--------------------------------------------
Summary: [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema
Key: ARROW-8087
URL: https://issues.apache.org/jira/browse/ARROW-8087
Project: Apache Arrow
Issue Type: Improvement
Components: C++ - Dataset
Reporter: Joris Van den Bossche
Currently, when reading a partitioned dataset with hive partitioning, it seems that the partition columns get sorted alphabetically when appending them to the schema (while the old ParquetDataset implementation keeps the order as it is present in the paths).
For a regular partitioning this order is consistent for all fragments.
So for example for the typical NYC Taxi data example, with datasets, the schema ends with columns "month, year", while the ParquetDataset appends them as "year, month".
Python example:
{code}
foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30
df = pd.DataFrame({
'foo': np.array(foo_keys, dtype='i4').repeat(15),
'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
'values': np.random.randn(N)
})
pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
{code}
{code}
>>> pq.read_table("test_order").schema
values: double
foo: dictionary<values=int64, indices=int32, ordered=0>
bar: dictionary<values=string, indices=int32, ordered=0>
>>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
values: double
bar: string
foo: int32
{code}
so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something else)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)