You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/04/15 09:27:00 UTC
[jira] [Assigned] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file
[ https://issues.apache.org/jira/browse/ARROW-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-16204:
---------------------------------------------
Assignee: Joris Van den Bossche
> [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-16204
> URL: https://issues.apache.org/jira/browse/ARROW-16204
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 8.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> While trying to understand a failing test in https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed that the {{write_dataset}} function does not actually always raise an error by default if there is already existing data in the target location.
> The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:
> {code:python}
> import pyarrow.dataset as ds
> table = pa.table({'a': [1, 2, 3]})
> # write a first time to new directory: OK
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a second time to the same directory: passes, but should raise?
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a another time to the same directory with different name: still passes
> >>> ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
> >>> !ls test_overwrite
> data-0.parquet part-0.parquet
> # now writing again finally raises an error
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> ...
> ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error
> {code}
> So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.
> cc [~westonpace] do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)