You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/18 17:20:21 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #3885: Consolidate remaining parquet config options into ConfigOptions

alamb opened a new pull request, #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885

   Draft as it builds on https://github.com/apache/arrow-datafusion/pull/3822
   
   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/3821
   
    # Rationale for this change
   1. making it easier for people to see what config options are available will make it more likely they are used
   2. The more mechanisms that configuration is supplied, the more likely it to confuse people
   
   This is especially important to us as we move to use more and more of the parquet code directly in IOx (e.g. https://github.com/influxdata/influxdb_iox/issues/5887 and https://github.com/influxdata/influxdb_iox/issues/5897)
   
   
   # What changes are included in this PR?
   1. move metadata_size_hint, enable_pruning, and merge_schema_metadata to new config options
   2. Remove redundancy and config overrides
   
   # Are there any user-facing changes?
   Hopefully not
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1035276922


##########
datafusion/core/src/config.rs:
##########
@@ -237,6 +247,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                 the file Schema. This setting can help avoid schema conflicts when querying \
+                 multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   Upon more thought I would like to propose we do the "consolidate config code" and "change default value" in separate PRs (which I think will be easier to see / review once the configuration is consolidated)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1325738963

   Here is a writeup of a way to improve the configuration situation somewhat https://github.com/apache/arrow-datafusion/issues/4349


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1328050035

   FWIW My plan for this PR is:
   1. To leave any the current places to configure parquet settings per ParquetExec as overrrides (`Option<val>`) 
   2. Expose remaining options via `ConfigOptions` (and make them visible, etc)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1000878323


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -52,59 +57,70 @@ pub const DEFAULT_PARQUET_EXTENSION: &str = ".parquet";
 /// The Apache Parquet `FileFormat` implementation
 #[derive(Debug)]
 pub struct ParquetFormat {
-    enable_pruning: bool,

Review Comment:
   Here was one copy of (some) of the settings



##########
datafusion/core/src/execution/context.rs:
##########
@@ -1142,8 +1146,6 @@ pub struct SessionConfig {
     /// Should DataFusion repartition data using the partition keys to execute window functions in
     /// parallel using the provided `target_partitions` level
     pub repartition_windows: bool,
-    /// Should DataFusion parquet reader using the predicate to prune data
-    pub parquet_pruning: bool,

Review Comment:
   here is a second copy of one of the settings



##########
datafusion/core/src/execution/options.rs:
##########
@@ -168,56 +170,31 @@ pub struct ParquetReadOptions<'a> {
     pub file_extension: &'a str,
     /// Partition Columns
     pub table_partition_cols: Vec<String>,
-    /// Should DataFusion parquet reader use the predicate to prune data,

Review Comment:
   here is a third copy, again of some of the settings



##########
datafusion/core/tests/sql/information_schema.rs:
##########
@@ -702,8 +702,11 @@ async fn show_all() {
         "| datafusion.execution.coalesce_batches           | true    |",
         "| datafusion.execution.coalesce_target_batch_size | 4096    |",
         "| datafusion.execution.parquet.enable_page_index  | false   |",
+        "| datafusion.execution.parquet.metadata_size_hint | NULL    |",

Review Comment:
   defaults are now even clearer!



##########
datafusion/core/tests/sql/parquet_schema.rs:
##########
@@ -141,6 +139,13 @@ async fn schema_merge_can_preserve_metadata() {
     let table_path = table_dir.to_str().unwrap().to_string();
 
     let ctx = SessionContext::new();
+
+    // explicitly disable schema clearing
+    ctx.config_options()

Review Comment:
   Ideally we will have a nicer API for this: https://github.com/apache/arrow-datafusion/issues/3908



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1002820504


##########
datafusion/core/src/config.rs:
##########
@@ -237,6 +247,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                 the file Schema. This setting can help avoid schema conflicts when querying \
+                 multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::Boolean,

Review Comment:
   Why is this a boolean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1289628826

   > I'm really struggling to review this PR as I'm honestly lost trying to understand the design of the config system...
   
   Yes, I agree it is very confusing. 
   
   Would you find it less confusing if the physical planner got a `SessionConfig` rather than `ConfigOptions`?
   
   
   > Plumbing this down into the physical plans therefore feels a bit like it is coupling together parts of the system that shouldn't be coupled together? I would have expected the physical operators to only be exposed to TaskContext, with SessionContext solely used for planning.
   
   I think the `SessionContext` has a `SessionState` which has  a `SessionConfig` which has a `ConfigOptions` 🤯 
   
   TaskContext has `TaskProperties` which is either a `SessionConfig` or some k=v pairs
   
   So in that view of things  your expectation "expected the physical operators to only be exposed to TaskContext, with SessionContext solely used for planning." does hold
   
   In my mind `ConfigOptions` and `SessionConfig` serve the same purpose and eventually they will be unified (using the implementation of `ConfigOptions`. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1291076213

   Marking as a draft until I get a writeup and some consensus


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032437474


##########
datafusion/core/src/config.rs:
##########
@@ -237,6 +247,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                 the file Schema. This setting can help avoid schema conflicts when querying \
+                 multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   Do we want to default this something? Back in the day we would default to reading the last 64k to try and capture the entire footer in a single read which seems like a sensible default (especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1331401524

   Updated version in https://github.com/apache/arrow-datafusion/pull/4427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032668603


##########
datafusion/core/src/config.rs:
##########
@@ -237,6 +247,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                 the file Schema. This setting can help avoid schema conflicts when querying \
+                 multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   64K seems reasonable to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1002828791


##########
datafusion/core/src/execution/options.rs:
##########
@@ -168,56 +170,31 @@ pub struct ParquetReadOptions<'a> {
     pub file_extension: &'a str,
     /// Partition Columns
     pub table_partition_cols: Vec<String>,
-    /// Should DataFusion parquet reader use the predicate to prune data,
-    /// overridden by value on execution::context::SessionConfig
-    // TODO move this into ConfigOptions
-    pub parquet_pruning: bool,
-    /// Tell the parquet reader to skip any metadata that may be in
-    /// the file Schema. This can help avoid schema conflicts due to
-    /// metadata.  Defaults to true.
-    // TODO move this into ConfigOptions
-    pub skip_metadata: bool,
 }
 
 impl<'a> Default for ParquetReadOptions<'a> {
     fn default() -> Self {
-        let format_default = ParquetFormat::default();
-
         Self {
             file_extension: DEFAULT_PARQUET_EXTENSION,
             table_partition_cols: vec![],
-            parquet_pruning: format_default.enable_pruning(),
-            skip_metadata: format_default.skip_metadata(),
         }
     }
 }
 
 impl<'a> ParquetReadOptions<'a> {
-    /// Specify parquet_pruning
-    pub fn parquet_pruning(mut self, parquet_pruning: bool) -> Self {

Review Comment:
   Doesn't this represent a functionality loss, previously you could configure these options on per-datasource basis. Now they are located on ConfigOptions which is session global?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1004760255


##########
datafusion/core/src/execution/options.rs:
##########
@@ -168,56 +170,31 @@ pub struct ParquetReadOptions<'a> {
     pub file_extension: &'a str,
     /// Partition Columns
     pub table_partition_cols: Vec<String>,
-    /// Should DataFusion parquet reader use the predicate to prune data,
-    /// overridden by value on execution::context::SessionConfig
-    // TODO move this into ConfigOptions
-    pub parquet_pruning: bool,
-    /// Tell the parquet reader to skip any metadata that may be in
-    /// the file Schema. This can help avoid schema conflicts due to
-    /// metadata.  Defaults to true.
-    // TODO move this into ConfigOptions
-    pub skip_metadata: bool,
 }
 
 impl<'a> Default for ParquetReadOptions<'a> {
     fn default() -> Self {
-        let format_default = ParquetFormat::default();
-
         Self {
             file_extension: DEFAULT_PARQUET_EXTENSION,
             table_partition_cols: vec![],
-            parquet_pruning: format_default.enable_pruning(),
-            skip_metadata: format_default.skip_metadata(),
         }
     }
 }
 
 impl<'a> ParquetReadOptions<'a> {
-    /// Specify parquet_pruning
-    pub fn parquet_pruning(mut self, parquet_pruning: bool) -> Self {

Review Comment:
   I would say "kind of" -- previously you could configure the options on a per-datasource basis but depending on exactly what codepath you used and what other options were set your settings might or might not get overridden.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb closed pull request #3885: Consolidate remaining parquet config options into ConfigOptions
URL: https://github.com/apache/arrow-datafusion/pull/3885


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1282753423

   There are still some tests and code updates required for this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#issuecomment-1289630809

   Here is my plan: I will write up a ticket explaining the current state of affairs in terms of config (and I will make a diagram) and we can use that to come to consensus on where we want to head. 
   
   Given the current mess of parquet options (has 3 *other* places to place config values *in addition* to the SessonConfig / ConfigOptions I think this PR is an improvement anyways.
   
   We'll see if @andygrove  has any thoughts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1003772616


##########
datafusion/core/src/config.rs:
##########
@@ -237,6 +247,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row groups based \
+                 on the predicate in the query and the metadata (min/max values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded metadata that may be in \
+                 the file Schema. This setting can help avoid schema conflicts when querying \
+                 multiple parquet files with schemas containing compatible types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last `size_hint` \
+                 bytes of the parquet file optimistically. If not specified, two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::Boolean,

Review Comment:
   You are right -- it should be u64 -- fixed in e5e3be076



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org