You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/01/07 16:18:00 UTC
[jira] [Commented] (ARROW-15265) [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large

    [ https://issues.apache.org/jira/browse/ARROW-15265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470724#comment-17470724 ] 

David Li commented on ARROW-15265:
----------------------------------

Thanks for the report. I can reproduce this with 6.0.1 and Minio, including that the directories are created but the write itself hangs. It also occurs on the development branch.

I haven't dug fully into this yet, but it seems what happens is that DeleteDir is synchronous, but is implemented underneath asynchronously by blocking upon a future. One DeleteDir call is made per partition, and the blocking happens on the IO thread pool. The IO thread pool has 8 threads by default, so with 8 partitions, all 8 threads are now being occupied. But DeleteDir's implementation works by spawning ListObjectsV2 call requests on the same IO thread pool. So no progress can be made.

> [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15265
>                 URL: https://issues.apache.org/jira/browse/ARROW-15265
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.1
>            Reporter: Caleb Overman
>            Priority: Major
>
> I'm attempting to use use the {{existing_data_behavior="delete_matching"}} option when using {{ds.write_dataset}} to write a hive partitioned parquet file to S3. This seems to work perfectly fine when the table being written is creating 7 or fewer partitions, but as soon as the partition column in the table has an 8th unique value the write completely hangs.
>  
> {code:java}
> import numpy as np
> import pyarrow as pa
> from pyarrow import fs
> import pyarrow.dataset as ds
> bucket = "my-bucket"
> s3 = fs.S3FileSystem()
> cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
> table_7 = pa.table(
>     {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
> )
> # succeeds
> ds.write_dataset(
>     data=table_7,
>     base_dir=f"{bucket}/test7.parquet",
>     format="parquet",
>     partitioning=["col1"],
>     partitioning_flavor="hive",
>     filesystem=s3,
>     existing_data_behavior="delete_matching",
> )
> cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
> table_8 = pa.table(
>     {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
> )
> # this hangs
> ds.write_dataset(
>     data=table_8,
>     base_dir=f"{bucket}/test8.parquet",
>     format="parquet",
>     partitioning=["col1"],
>     partitioning_flavor="hive",
>     filesystem=s3,
>     existing_data_behavior="delete_matching",
> ) {code}
> For the file with 8 partitions, the directory structure is created in S3 but no actual files are written before hanging.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)