You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/13 16:03:37 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #3822: Expose parquet reader settings as DataFusion config settings

alamb opened a new pull request, #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/3821
   
    # Rationale for this change
   I want to test out the parquet filter pushdown on real datasets using datafusion-cli so we can enable it by default -- https://github.com/apache/arrow-datafusion/issues/3463
   
   I want to be able to do so via `datafusion-cli` like:
   
   ```shell
   $ target/debug/datafusion-cli
   DataFusion CLI v13.0.0
   ❯ show all;
   +-------------------------------------------------+---------+
   | name                                            | setting |
   +-------------------------------------------------+---------+
   | datafusion.execution.time_zone                  | UTC     |
   | datafusion.execution.parquet.pushdown_filters   | false   | <---- Note the option is now visible here
   | datafusion.explain.physical_plan_only           | false   |
   | datafusion.execution.coalesce_target_batch_size | 4096    |
   | datafusion.execution.batch_size                 | 8192    |
   | datafusion.execution.coalesce_batches           | true    |
   | datafusion.explain.logical_plan_only            | false   |
   | datafusion.optimizer.skip_failed_rules          | true    |
   | datafusion.optimizer.filter_null_join_keys      | false   |
   +-------------------------------------------------+---------+
   ```
   
   And then set them like:
   
   ```shell
   $ DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true target/debug/datafusion-cli
   DataFusion CLI v13.0.0
   ❯ show all;
   +-------------------------------------------------+---------+
   | name                                            | setting |
   +-------------------------------------------------+---------+
   | datafusion.execution.batch_size                 | 8192    |
   | datafusion.execution.coalesce_batches           | true    |
   | datafusion.explain.logical_plan_only            | false   |
   | datafusion.optimizer.filter_null_join_keys      | false   |
   | datafusion.execution.parquet.enable_page_index  | false   |
   | datafusion.optimizer.skip_failed_rules          | true    |
   | datafusion.explain.physical_plan_only           | false   |
   | datafusion.execution.time_zone                  | UTC     |
   | datafusion.execution.coalesce_target_batch_size | 4096    |
   | datafusion.execution.parquet.pushdown_filters   | true    | <---- Note the option is set to true here!!!!
   | datafusion.execution.parquet.reorder_filters    | false   |
   +-------------------------------------------------+---------+
   ```
   
   
   # What changes are included in this PR?
   1. Add three new config settings to `ConfigOptions`
   3. Thread `ConfigOptions` down into the FileScanConfig
   2. Remove `ParquetScanOptions` in favor of these new configs (will comment on the rationale here)
   
   # Are there any user-facing changes?
   YES: If you used `ParquetScanOptions` (which I know @thinkharderdev  does) the API has changed. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r994848144


##########
benchmarks/src/bin/parquet_filter_pushdown.rs:
##########
@@ -109,6 +111,13 @@ async fn main() -> Result<()> {
     Ok(())
 }
 
+#[derive(Debug, Clone)]
+struct ParquetScanOptions {

Review Comment:
   this was replicated for the benchmark code as I felt such a struct was the easiest to understand for this matrix strategy



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -69,43 +72,6 @@ use parquet::file::{
 };
 use parquet::schema::types::ColumnDescriptor;
 
-#[derive(Debug, Clone, Default)]
-/// Specify options for the parquet scan
-pub struct ParquetScanOptions {

Review Comment:
   The key change to this PR is removing this structure and instead reading the values from a `ConfigOptions` that is threaded down.
   
   You can see in this PR there is already a structure for configuring parquet reading (`ParquetReadOptions`) so I actually think this will make the code less confusing to work with going forward. 



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -944,17 +968,16 @@ mod tests {
                 projection,
                 limit: None,
                 table_partition_cols: vec![],
+                config_options: ConfigOptions::new().into_shareable(),
             },
             predicate,
             None,
         );
 
         if pushdown_predicate {
-            parquet_exec = parquet_exec.with_scan_options(
-                ParquetScanOptions::default()
-                    .with_pushdown_filters(true)
-                    .with_reorder_predicates(true),
-            );
+            parquet_exec = parquet_exec

Review Comment:
   Here is a good illustration of how the API changes (I think for the better)



##########
datafusion/core/src/execution/options.rs:
##########
@@ -160,10 +170,12 @@ pub struct ParquetReadOptions<'a> {
     pub table_partition_cols: Vec<String>,
     /// Should DataFusion parquet reader use the predicate to prune data,
     /// overridden by value on execution::context::SessionConfig
+    // TODO move this into ConfigOptions

Review Comment:
   I will do this as a follow on PR



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -403,9 +426,9 @@ impl FileOpener for ParquetOpener {
         let projection = self.projection.clone();
         let pruning_predicate = self.pruning_predicate.clone();
         let table_schema = self.table_schema.clone();
-        let reorder_predicates = self.scan_options.reorder_predicates;
-        let pushdown_filters = self.scan_options.pushdown_filters;
-        let enable_page_index = self.scan_options.enable_page_index;
+        let reorder_predicates = self.reorder_filters;

Review Comment:
   I also took the opportunity to change to consistently use the word `filters` rather than `filters` and `predciates`



##########
datafusion/core/src/config.rs:
##########
@@ -255,8 +289,16 @@ impl ConfigOptions {
         Self { options }
     }
 
-    /// Create new ConfigOptions struct, taking values from environment variables where possible.
-    /// For example, setting `DATAFUSION_EXECUTION_BATCH_SIZE` to control `datafusion.execution.batch_size`.
+    /// Create a new [`ConfigOptions`] wrapped in an RwLock and Arc
+    pub fn into_shareable(self) -> Arc<RwLock<Self>> {
+        Arc::new(RwLock::new(self))
+    }
+
+    /// Create new ConfigOptions struct, taking values from
+    /// environment variables where possible.
+    ///
+    /// For example, setting `DATAFUSION_EXECUTION_BATCH_SIZE` will

Review Comment:
   I will add some documentation about this to the datafusion-cli docs as I couldn't find it when I was looking



##########
datafusion/core/src/config.rs:
##########
@@ -191,23 +205,43 @@ impl BuiltInConfigs {
              ConfigDefinition::new_u64(
                  OPT_COALESCE_TARGET_BATCH_SIZE,
                  format!("Target batch size when coalescing batches. Uses in conjunction with the \
-            configuration setting '{}'.", OPT_COALESCE_BATCHES),
+                          configuration setting '{}'.", OPT_COALESCE_BATCHES),
                  4096,
             ),
+            ConfigDefinition::new_string(

Review Comment:
   I moved this to be with the other settings



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -191,15 +154,71 @@ impl ParquetExec {
         self
     }
 
-    /// Configure `ParquetScanOptions`
-    pub fn with_scan_options(mut self, scan_options: ParquetScanOptions) -> Self {
-        self.scan_options = scan_options;
+    /// If true, any filter [`Expr`]s on the scan will converted to a
+    /// [`RowFilter`](parquet::arrow::arrow_reader::RowFilter) in the
+    /// `ParquetRecordBatchStream`. These filters are applied by the
+    /// parquet decoder to skip unecessairly decoding other columns
+    /// which would not pass the predicate. Defaults to false
+    pub fn with_pushdown_filters(self, pushdown_filters: bool) -> Self {

Review Comment:
   This just uses the slightly messier config API to get/set the settings



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r996292693


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -698,6 +699,7 @@ mod tests {
             projection,
             statistics,
             table_partition_cols,
+            config_options: ConfigOptions::new().into_shareable(),

Review Comment:
   I agree it is non obvious -- `ConfigOptions` was changed to be shareable  by @waitingkuo  in https://github.com/apache/arrow-datafusion/pull/3455
   
   Basically the usecase there was so that the configuration was owned by `SessionContext` but other parts could read it if necessary -- specifically,  `information_schema.df_settings` / `SHOW ALL` initially. This PR extends the concept so that the settings can be read by the parquet reader
   
   What would you think about moving the mutability handling into `ConfigOption` so this looks like
   
   ```rust
               config_options: ConfigOptions::new(),
   ```
   
   That would hide the details of shared ownership more nicely I think



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb merged PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#issuecomment-1282456715

   I plan to merge this after it passes CI to keep the process going. I really like the `ConfigOptions` structure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r1000917099


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -698,6 +699,7 @@ mod tests {
             projection,
             statistics,
             table_partition_cols,
+            config_options: ConfigOptions::new().into_shareable(),

Review Comment:
   PR to clean up: https://github.com/apache/arrow-datafusion/pull/3909



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#issuecomment-1278835737

   Nice! I will review this this evening or tomorrow depending on how the day goes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r996289941


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -191,15 +154,71 @@ impl ParquetExec {
         self
     }
 
-    /// Configure `ParquetScanOptions`
-    pub fn with_scan_options(mut self, scan_options: ParquetScanOptions) -> Self {
-        self.scan_options = scan_options;
+    /// If true, any filter [`Expr`]s on the scan will converted to a
+    /// [`RowFilter`](parquet::arrow::arrow_reader::RowFilter) in the
+    /// `ParquetRecordBatchStream`. These filters are applied by the
+    /// parquet decoder to skip unecessairly decoding other columns
+    /// which would not pass the predicate. Defaults to false
+    pub fn with_pushdown_filters(self, pushdown_filters: bool) -> Self {
+        self.base_config
+            .config_options
+            .write()
+            .set_bool(OPT_PARQUET_PUSHDOWN_FILTERS, pushdown_filters);
+        self
+    }
+
+    /// Return the value described in [`Self::with_pushdown_filters`]
+    pub fn pushdown_filters(&self) -> bool {
+        self.base_config
+            .config_options
+            .read()
+            .get_bool(OPT_PARQUET_PUSHDOWN_FILTERS)
+            // default to false
+            .unwrap_or_default()
+    }
+
+    /// If true, the `RowFilter` made by `pushdown_filters` may try to
+    /// minimize the cost of filter evaluation by reordering the
+    /// predicate [`Expr`]s. If false, the predicates are applied in
+    /// the same order as specified in the query. Defaults to false.
+    pub fn with_reorder_filters(self, reorder_filters: bool) -> Self {

Review Comment:
   See above, wrapping the options in a `Arc<RwLock<_>>` seems strange since this is already essentially an owned value. 



##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -698,6 +699,7 @@ mod tests {
             projection,
             statistics,
             table_partition_cols,
+            config_options: ConfigOptions::new().into_shareable(),

Review Comment:
   Not sure I understand why we need the `into_shareable` here. Seems like this should just be an owned `ConfigOptions`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r998288936


##########
datafusion/core/src/config.rs:
##########
@@ -255,8 +289,16 @@ impl ConfigOptions {
         Self { options }
     }
 
-    /// Create new ConfigOptions struct, taking values from environment variables where possible.
-    /// For example, setting `DATAFUSION_EXECUTION_BATCH_SIZE` to control `datafusion.execution.batch_size`.
+    /// Create a new [`ConfigOptions`] wrapped in an RwLock and Arc
+    pub fn into_shareable(self) -> Arc<RwLock<Self>> {
+        Arc::new(RwLock::new(self))
+    }
+
+    /// Create new ConfigOptions struct, taking values from
+    /// environment variables where possible.
+    ///
+    /// For example, setting `DATAFUSION_EXECUTION_BATCH_SIZE` will

Review Comment:
   https://github.com/apache/arrow-datafusion/pull/3825



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r996298397


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -698,6 +699,7 @@ mod tests {
             projection,
             statistics,
             table_partition_cols,
+            config_options: ConfigOptions::new().into_shareable(),

Review Comment:
   Yeah, that makes sense. If `ConfigOptions` is meant to be shareable then it can just hold a `Arc<RwLock<HashMap<_>>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r998292558


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -698,6 +699,7 @@ mod tests {
             projection,
             statistics,
             table_partition_cols,
+            config_options: ConfigOptions::new().into_shareable(),

Review Comment:
   Will do as a follow on PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3822: Expose parquet reader settings using normal DataFusion `ConfigOptions`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822#discussion_r998521622


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -698,6 +699,7 @@ mod tests {
             projection,
             statistics,
             table_partition_cols,
+            config_options: ConfigOptions::new().into_shareable(),

Review Comment:
   Filed https://github.com/apache/arrow-datafusion/issues/3886



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org