You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/29 22:19:39 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #4427: Expose remaining parquet config options into ConfigOptions (try 2)

alamb opened a new pull request, #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427

this is a reworked version of https://github.com/apache/arrow-datafusion/pull/3885

# Which issue does this PR close?

Closes https://github.com/apache/arrow-datafusion/issues/3821

This also helps towards #3887

# Rationale for this change
1. Make it easier for people to see what parquet config options are available will make it more likely they are used
2. The more mechanisms that configuration is supplied, the more likely it to confuse people

It turns out options for reading parquet files were able to be set (and possibly) overridden by no less than three different structures! This is confusing, to say the least.

# What changes are included in this PR?
1. move metadata_size_hint, enable_pruning, and merge_schema_metadata to new config options
2. Make the precidence of the parquet options passed down to the ParquetExec clear

# Are there any user-facing changes?
The main change is that now all parquet reader settings are visible session wide.

Previously, depending on which of the APIs was used to create / register / run parquet, the settings might change if you change the session config or they might have been a snapshot based on when you registered the reader

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #4427: Expose remaining parquet config options into ConfigOptions (try 2)

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on code in PR #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427#discussion_r1036980062


##########
docs/source/user-guide/configs.md:
##########
@@ -43,8 +43,11 @@ Environment variables are read during `SessionConfig` initialisation so they mus
 | datafusion.execution.coalesce_batches                     | Boolean | true    | When set to true, record batches will be examined between each operator and small batches will be coalesced into larger batches. This is helpful when there are highly selective filters or joins that could produce tiny output batches. The target batch size is determined by the configuration setting 'datafusion.execution.coalesce_target_batch_size'. |
 | datafusion.execution.coalesce_target_batch_size           | UInt64  | 4096    | Target batch size when coalescing batches. Uses in conjunction with the configuration setting 'datafusion.execution.coalesce_batches'.                                                                                                                                                                                                                        |
 | datafusion.execution.parquet.enable_page_index            | Boolean | false   | If true, uses parquet data page level metadata (Page Index) statistics to reduce the number of rows decoded.                                                                                                                                                                                                                                                  |
+| datafusion.execution.parquet.metadata_size_hint           | UInt64  | NULL    | If specified, the parquet reader will try and fetch the last `size_hint` bytes of the parquet file optimistically. If not specified, two read are required: One read to fetch the 8-byte parquet footer and another to fetch the metadata length encoded in the footer.                                                                                       |

Review Comment:
   ```suggestion
   | datafusion.execution.parquet.metadata_size_hint           | UInt64  | NULL    | If specified, the parquet reader will try and fetch the last `size_hint` bytes of the parquet file optimistically. If not specified, two reads are required: One read to fetch the 8-byte parquet footer and another to fetch the metadata length encoded in the footer.                                                                                       |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4427: Expose remaining parquet config options into ConfigOptions (try 2)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427#discussion_r1037291870


##########
datafusion/core/src/config.rs:
##########
@@ -255,6 +265,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                the file Schema. This setting can help avoid schema conflicts when querying \
+                multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   Filed https://github.com/apache/arrow-datafusion/issues/4459 to track



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #4427: Expose remaining parquet config options into ConfigOptions (try 2)

Posted by GitBox <gi...@apache.org>.

alamb merged PR #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #4427: Expose remaining parquet config options into ConfigOptions (try 2)

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427#issuecomment-1334048651

   Benchmark runs are scheduled for baseline = 09aea09b6e7bd2950c0f57296ccb1a1995570057 and contender = fb8eeb2ac80adeb872a32fc95cb29dbc17f0de60. fb8eeb2ac80adeb872a32fc95cb29dbc17f0de60 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d17708e63fdf45978d237068108d36ca...fadcfc10cc07483bb3648b73f6468a85/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/950bd4ffe06045139e7699445cc36a22...aaaf3a8a7a4648be9004dba2d950ddff/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9f5d78b8e5ce4e36b95207cb6afbc7e2...bc0bc08b71034cc3a61b65ef4c964541/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a578ac0ae7124f7faa2034f585049f09...644f98d31068413791b30c0864194a38/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4427: Expose remaining parquet config options into ConfigOptions (try 2)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #4427:
URL: https://github.com/apache/arrow-datafusion/pull/4427#discussion_r1035340983


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -396,7 +396,8 @@ async fn get_table(
             }
             "parquet" => {
                 let path = format!("{}/{}", path, table);
-                let format = ParquetFormat::default().with_enable_pruning(true);
+                let format = ParquetFormat::new(ctx.config_options())

Review Comment:
   As the parquet format now reads defaults from `ConfigOptions` they need to be passed to the constructor



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -52,51 +57,69 @@ pub const DEFAULT_PARQUET_EXTENSION: &str = ".parquet";
 /// The Apache Parquet `FileFormat` implementation
 #[derive(Debug)]
 pub struct ParquetFormat {
-    enable_pruning: bool,
+    // Session level configuration
+    config_options: Arc<RwLock<ConfigOptions>>,
+    // local overides
+    enable_pruning: Option<bool>,

Review Comment:
   By changing these to `Option` I think it is now clearer that if they are left at default, the (documented) value from ConfigOptions is used



##########
datafusion/core/src/execution/context.rs:
##########
@@ -814,13 +818,7 @@ impl SessionContext {
         table_path: &str,
         options: ParquetReadOptions<'_>,
     ) -> Result<()> {
-        let (target_partitions, parquet_pruning) = {
-            let conf = self.copied_config();
-            (conf.target_partitions, conf.parquet_pruning)
-        };
-        let listing_options = options
-            .parquet_pruning(parquet_pruning)
-            .to_listing_options(target_partitions);
+        let listing_options = options.to_listing_options(&self.state.read().config);

Review Comment:
   In some ways I think this is cleaner as now the options make it to the parquet reader, where as before there are places like this that copy some (but not all) of the settings around



##########
datafusion/core/tests/sql/information_schema.rs:
##########
@@ -707,8 +707,11 @@ async fn show_all() {
         "| datafusion.execution.coalesce_batches                     | true    |",
         "| datafusion.execution.coalesce_target_batch_size           | 4096    |",
         "| datafusion.execution.parquet.enable_page_index            | false   |",
+        "| datafusion.execution.parquet.metadata_size_hint           | NULL    |",

Review Comment:
   🎉  one can now see much more explicitly both 1) what the parquet options are and 2) what their default values are



##########
datafusion/core/src/execution/context.rs:
##########
@@ -1183,7 +1179,6 @@ impl Default for SessionConfig {
             repartition_joins: true,
             repartition_aggregations: true,
             repartition_windows: true,
-            parquet_pruning: true,

Review Comment:
   yet another copy of this setting!



##########
datafusion/core/src/config.rs:
##########
@@ -255,6 +265,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                the file Schema. This setting can help avoid schema conflicts when querying \
+                multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   As mentioned by @thinkharderdev on https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032437474 we should probably change this default to something reasonable (like 64K) but I would rather do that in a follow on PR



##########
datafusion/core/src/execution/options.rs:
##########
@@ -167,42 +169,40 @@ pub struct ParquetReadOptions<'a> {
     pub file_extension: &'a str,
     /// Partition Columns
     pub table_partition_cols: Vec<(String, DataType)>,
-    /// Should DataFusion parquet reader use the predicate to prune data,
-    /// overridden by value on execution::context::SessionConfig
-    // TODO move this into ConfigOptions
-    pub parquet_pruning: bool,
-    /// Tell the parquet reader to skip any metadata that may be in
-    /// the file Schema. This can help avoid schema conflicts due to
-    /// metadata.  Defaults to true.
-    // TODO move this into ConfigOptions
-    pub skip_metadata: bool,
+    /// Should the parquet reader use the predicate to prune row groups?

Review Comment:
   again, here make it clear that these settings are overrides to the defaults on the session configuration



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -72,6 +72,16 @@ use super::get_output_ordering;
 /// Execution plan for scanning one or more Parquet partitions
 #[derive(Debug, Clone)]
 pub struct ParquetExec {
+    /// Override for `Self::with_pushdown_filters`. If None, uses

Review Comment:
   I also added this back in so that the overrides set directly on a `ParquetExec` do not affect the global configuration options



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org