You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/10/26 12:10:32 UTC

Re: [PR] Move source repartitioning into `ExecutionPlan::repartition` [arrow-datafusion]

alamb commented on code in PR #7936:
URL: https://github.com/apache/arrow-datafusion/pull/7936#discussion_r1373054726


##########
datafusion/physical-plan/src/lib.rs:
##########
@@ -209,15 +210,34 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
         children: Vec<Arc<dyn ExecutionPlan>>,
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
-    /// creates an iterator
+    /// If supported, changes the partitioning of this `ExecutionPlan` to
+    /// produce `target_partitions` partitions.
+    ///
+    /// If the `ExecutionPlan` does not support changing its partitioning,
+    /// returns `Ok(None)` (the default).
+    ///
+    /// The DataFusion optimizer attempts to use as many threads as possible by
+    /// repartitioning its inputs to match the target number of threads
+    /// available (`target_partitions`). Some data sources, such as the built in
+    /// CSV and Parquet readers, are able to read from their input files in
+    /// parallel, regardless of how the source data is split amongst files.
+    fn repartitioned(

Review Comment:
   The PR hoists this common code to the `ExecutionPlan` trait, which is easier to understand as well as allowing user defined sources to be repartitioned, if they support that



##########
datafusion/core/src/datasource/physical_plan/csv.rs:
##########
@@ -117,34 +118,6 @@ impl CsvExec {
     pub fn escape(&self) -> Option<u8> {
         self.escape
     }
-
-    /// Redistribute files across partitions according to their size
-    /// See comments on `repartition_file_groups()` for more detail.
-    ///
-    /// Return `None` if can't get repartitioned(empty/compressed file).
-    pub fn get_repartitioned(

Review Comment:
   This just was moved into the `impl ExecutionPlan` and the signature is changed



##########
datafusion/core/src/physical_optimizer/enforce_distribution.rs:
##########
@@ -26,9 +26,6 @@ use std::fmt::Formatter;
 use std::sync::Arc;
 
 use crate::config::ConfigOptions;
-use crate::datasource::physical_plan::CsvExec;

Review Comment:
   Now enforce distribution is not dependent on the specific operators 🎉 



##########
datafusion/core/src/physical_optimizer/enforce_distribution.rs:
##########
@@ -26,9 +26,6 @@ use std::fmt::Formatter;
 use std::sync::Arc;
 
 use crate::config::ConfigOptions;
-use crate::datasource::physical_plan::CsvExec;

Review Comment:
   Now enforce distribution is not dependent on the specific operators 🎉 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org