You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/29 22:27:03 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4427: Expose remaining parquet config options into ConfigOptions (try 2)

alamb commented on code in PR #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427#discussion_r1035340983


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -396,7 +396,8 @@ async fn get_table(
             }
             "parquet" => {
                 let path = format!("{}/{}", path, table);
-                let format = ParquetFormat::default().with_enable_pruning(true);
+                let format = ParquetFormat::new(ctx.config_options())

Review Comment:
   As the parquet format now reads defaults from `ConfigOptions` they need to be passed to the constructor



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -52,51 +57,69 @@ pub const DEFAULT_PARQUET_EXTENSION: &str = ".parquet";
 /// The Apache Parquet `FileFormat` implementation
 #[derive(Debug)]
 pub struct ParquetFormat {
-    enable_pruning: bool,
+    // Session level configuration
+    config_options: Arc<RwLock<ConfigOptions>>,
+    // local overides
+    enable_pruning: Option<bool>,

Review Comment:
   By changing these to `Option` I think it is now clearer that if they are left at default, the (documented) value from ConfigOptions is used



##########
datafusion/core/src/execution/context.rs:
##########
@@ -814,13 +818,7 @@ impl SessionContext {
         table_path: &str,
         options: ParquetReadOptions<'_>,
     ) -> Result<()> {
-        let (target_partitions, parquet_pruning) = {
-            let conf = self.copied_config();
-            (conf.target_partitions, conf.parquet_pruning)
-        };
-        let listing_options = options
-            .parquet_pruning(parquet_pruning)
-            .to_listing_options(target_partitions);
+        let listing_options = options.to_listing_options(&self.state.read().config);

Review Comment:
   In some ways I think this is cleaner as now the options make it to the parquet reader, where as before there are places like this that copy some (but not all) of the settings around



##########
datafusion/core/tests/sql/information_schema.rs:
##########
@@ -707,8 +707,11 @@ async fn show_all() {
         "| datafusion.execution.coalesce_batches                     | true    |",
         "| datafusion.execution.coalesce_target_batch_size           | 4096    |",
         "| datafusion.execution.parquet.enable_page_index            | false   |",
+        "| datafusion.execution.parquet.metadata_size_hint           | NULL    |",

Review Comment:
   🎉  one can now see much more explicitly both 1) what the parquet options are and 2) what their default values are



##########
datafusion/core/src/execution/context.rs:
##########
@@ -1183,7 +1179,6 @@ impl Default for SessionConfig {
             repartition_joins: true,
             repartition_aggregations: true,
             repartition_windows: true,
-            parquet_pruning: true,

Review Comment:
   yet another copy of this setting!



##########
datafusion/core/src/config.rs:
##########
@@ -255,6 +265,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                the file Schema. This setting can help avoid schema conflicts when querying \
+                multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   As mentioned by @thinkharderdev on https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032437474 we should probably change this default to something reasonable (like 64K) but I would rather do that in a follow on PR



##########
datafusion/core/src/execution/options.rs:
##########
@@ -167,42 +169,40 @@ pub struct ParquetReadOptions<'a> {
     pub file_extension: &'a str,
     /// Partition Columns
     pub table_partition_cols: Vec<(String, DataType)>,
-    /// Should DataFusion parquet reader use the predicate to prune data,
-    /// overridden by value on execution::context::SessionConfig
-    // TODO move this into ConfigOptions
-    pub parquet_pruning: bool,
-    /// Tell the parquet reader to skip any metadata that may be in
-    /// the file Schema. This can help avoid schema conflicts due to
-    /// metadata.  Defaults to true.
-    // TODO move this into ConfigOptions
-    pub skip_metadata: bool,
+    /// Should the parquet reader use the predicate to prune row groups?

Review Comment:
   again, here make it clear that these settings are overrides to the defaults on the session configuration



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -72,6 +72,16 @@ use super::get_output_ordering;
 /// Execution plan for scanning one or more Parquet partitions
 #[derive(Debug, Clone)]
 pub struct ParquetExec {
+    /// Override for `Self::with_pushdown_filters`. If None, uses

Review Comment:
   I also added this back in so that the overrides set directly on a `ParquetExec` do not affect the global configuration options



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org