You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/10/25 07:10:00 UTC

[jira] [Updated] (ARROW-10883) [C++][Dataset] Preserve order when writing dataset

     [ https://issues.apache.org/jira/browse/ARROW-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-10883:
------------------------------------------
    Fix Version/s: 11.0.0

> [C++][Dataset] Preserve order when writing dataset
> --------------------------------------------------
>
>                 Key: ARROW-10883
>                 URL: https://issues.apache.org/jira/browse/ARROW-10883
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>             Fix For: 11.0.0
>
>
> Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset.
> Small code example:
> {code}
> In [1]: import pyarrow.dataset as ds
> In [2]: table = pa.table({"a": range(10)})
> In [3]: table.to_pandas()
> Out[3]: 
>    a
> 0  0
> 1  1
> 2  2
> 3  3
> 4  4
> 5  5
> 6  6
> 7  7
> 8  8
> 9  9
> In [4]: batches = table.to_batches(max_chunksize=2)
> In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet")
> In [6]: ds.dataset("test_dataset_order").to_table().to_pandas()
> Out[6]: 
>    a
> 0  4
> 1  5
> 2  8
> 3  9
> 4  6
> 5  7
> 6  2
> 7  3
> 8  0
> 9  1
> {code}
> Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order. 
> Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use {{pyarrow.dataset.write_dataset}} so isn't impacted by this.)
> Some discussion about this started in https://github.com/apache/arrow/pull/8305 (ARROW-9782), which changed to write all fragments to a single file instead of a file per fragment.
> I am not fully sure what the best way to solve this, but IMO at least having the _option_ to preserve the order would be good.
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)