You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/04/15 09:27:00 UTC

[jira] [Assigned] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file

     [ https://issues.apache.org/jira/browse/ARROW-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reassigned ARROW-16204:
---------------------------------------------

    Assignee: Joris Van den Bossche

> [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16204
>                 URL: https://issues.apache.org/jira/browse/ARROW-16204
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 8.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> While trying to understand a failing test in https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed that the {{write_dataset}} function does not actually always raise an error by default if there is already existing data in the target location.
> The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:
> {code:python}
> import pyarrow.dataset as ds
> table = pa.table({'a': [1, 2, 3]})
> # write a first time to new directory: OK
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a second time to the same directory: passes, but should raise?
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a another time to the same directory with different name: still passes
> >>> ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
> >>> !ls test_overwrite
> data-0.parquet	part-0.parquet
> # now writing again finally raises an error
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> ...
> ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error
> {code}
> So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.
> cc [~westonpace] do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)