You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/12/29 16:31:00 UTC

[jira] [Resolved] (ARROW-10995) [Rust] [DataFusion] Improve parallelism when reading Parquet files

     [ https://issues.apache.org/jira/browse/ARROW-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove resolved ARROW-10995.
--------------------------------
    Resolution: Fixed

Issue resolved by pull request 9029
[https://github.com/apache/arrow/pull/9029]

> [Rust] [DataFusion] Improve parallelism when reading Parquet files
> ------------------------------------------------------------------
>
>                 Key: ARROW-10995
>                 URL: https://issues.apache.org/jira/browse/ARROW-10995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently the unit of parallelism is the number of parquet files being read.
> For example, if we run a query against a Parquet table that consists of 8 partitions then we will attempt to run 8 async tasks in parallel and if there is a single Parquet file then we will only try and run 1 async task so this does not scale well. Also, if there are hundreds or thousands of Parquet files then we will try and process them all concurrently which also doesn't scale well.
> These are the options for improving this situation:
>  
>  # Use Parquet row groups as the unit of partitioning and divide the number of row groups by the desired level of concurrency (defaulting to number of cores)
>  # Keep file as the unit of partitions and add a RepartitionExec into the plan if there are fewer partitions (files) than cores and in the case where there are more files than cores, split the files up into lists so that each partition is a list of files rather than a single file. Each partition task will process one file at a time.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)