You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/03/01 14:13:00 UTC

[jira] [Resolved] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads

     [ https://issues.apache.org/jira/browse/ARROW-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Li resolved ARROW-15784.
------------------------------
    Fix Version/s: 8.0.0
       Resolution: Fixed

Issue resolved by pull request 12514
[https://github.com/apache/arrow/pull/12514]

> [C++][Python] Parallel parquet file reading disabled with single file reads
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-15784
>                 URL: https://issues.apache.org/jira/browse/ARROW-15784
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 7.0.0
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 8.0.0, 7.0.1
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> There is a flag {{enable_parallel_column_conversion}} which was passed down from python to C++ when reading parquet datasets which controlled whether we would read columns in parallel.  This was allowed for single files but not for reading multiple files.  This was an old check to help prevent nested deadlock.
> Nested deadlock is no longer an issue and the flag was mostly inert once we removed the synchronous scanner.
> Unfortunately, when we removed the synchronous scanner we forgot to remove this flag and the result was that a single-file read ended up disabling parallelism.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)