You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/01/12 20:33:00 UTC

[jira] [Resolved] (ARROW-15265) [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large

     [ https://issues.apache.org/jira/browse/ARROW-15265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Li resolved ARROW-15265.
------------------------------
    Fix Version/s: 7.0.0
       Resolution: Fixed

Issue resolved by pull request 12099
[https://github.com/apache/arrow/pull/12099]

> [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15265
>                 URL: https://issues.apache.org/jira/browse/ARROW-15265
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.1
>            Reporter: Caleb Overman
>            Assignee: David Li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 7.0.0
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> I'm attempting to use use the {{existing_data_behavior="delete_matching"}} option when using {{ds.write_dataset}} to write a hive partitioned parquet file to S3. This seems to work perfectly fine when the table being written is creating 7 or fewer partitions, but as soon as the partition column in the table has an 8th unique value the write completely hangs.
>  
> {code:java}
> import numpy as np
> import pyarrow as pa
> from pyarrow import fs
> import pyarrow.dataset as ds
> bucket = "my-bucket"
> s3 = fs.S3FileSystem()
> cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
> table_7 = pa.table(
>     {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
> )
> # succeeds
> ds.write_dataset(
>     data=table_7,
>     base_dir=f"{bucket}/test7.parquet",
>     format="parquet",
>     partitioning=["col1"],
>     partitioning_flavor="hive",
>     filesystem=s3,
>     existing_data_behavior="delete_matching",
> )
> cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
> table_8 = pa.table(
>     {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
> )
> # this hangs
> ds.write_dataset(
>     data=table_8,
>     base_dir=f"{bucket}/test8.parquet",
>     format="parquet",
>     partitioning=["col1"],
>     partitioning_flavor="hive",
>     filesystem=s3,
>     existing_data_behavior="delete_matching",
> ) {code}
> For the file with 8 partitions, the directory structure is created in S3 but no actual files are written before hanging.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)