You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/09/23 13:56:00 UTC
[jira] [Updated] (ARROW-7385) [Python] ParquetDataset deadlock with different metadata_nthreads values
[ https://issues.apache.org/jira/browse/ARROW-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-7385:
-----------------------------------------
Labels: dataset dataset-parquet-legacy dataset-parquet-read parquet (was: dataset dataset-parquet-read parquet)
> [Python] ParquetDataset deadlock with different metadata_nthreads values
> ------------------------------------------------------------------------
>
> Key: ARROW-7385
> URL: https://issues.apache.org/jira/browse/ARROW-7385
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.12.1, 0.14.1, 0.15.1
> Reporter: Chongkai Zhu
> Priority: Major
> Labels: dataset, dataset-parquet-legacy, dataset-parquet-read, parquet
>
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> output_folder = "C:\scr\tmp"
> weather_df = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 3, 3, 3], "b": [1, 1, 1, 1, 5, 1, 1, 1, 1], "c": ["c1", "c1", "c1", "c10", "c20", "c30", "c1", "c1", "c1"], "d": [32, 32, 32, 32, 32, 32, 32, 32, 32] })
> table = pa.Table.from_pandas(weather_df)
> pq.write_to_dataset(table, root_path=output_folder, partition_cols=["a", "b", "c"])
> {code}
> h1. works for 1 thread
> {code}
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=1, validate_schema=False)
> {code}
> h1. stuck for 2~6 threads (but it may vary from time to time)
> {code}
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=2, validate_schema=False)
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=6, validate_schema=False)
> {code}
> h1. works for 60 thread
> {code}
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=60, validate_schema=False)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)