You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/07/06 21:11:32 UTC

[GitHub] [iceberg] rdblue opened a new pull request, #5215: Core: Update MetricsConfig to use a default for first 32 columns

rdblue opened a new pull request, #5215:
URL: https://github.com/apache/iceberg/pull/5215

   #3959 updated `MetricsConfig` to stop writing metrics by default for tables with more than 32 columns. The intent was to avoid having too much metadata stored as stats in manifest files, but the implementation stopped writing metrics for all columns after a table reached 32 columns, which caused a large change in table performance after a table grows to 32+ columns.
   
   This updates the logic so that the first 32 columns still have metrics, but no new metrics are stored for columns after that point unless the table has a global default set or columns are individually configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916072054


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,

Review Comment:
   Looks like invalid from, to will trigger exception in sublist(from, to) anyway , if from >= to, was thinking a precondition would make the message clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915474955


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   Quick warning: `subList` would return a view on top of the original list and any subsequent changes would be reflected in both. This does not seem to cause issues here but I better mention.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916298739


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -267,6 +267,10 @@ private TableProperties() {
   public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
   public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;
 
+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =

Review Comment:
   Let's just keep it as-is then. I don't think it is a big deal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915475413


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   What if I have a highly nested schema? The number of stored metrics can be way more than 32 in that case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #5215:
URL: https://github.com/apache/iceberg/pull/5215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915483283


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -267,6 +267,10 @@ private TableProperties() {
   public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
   public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;
 
+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =

Review Comment:
   The naming felt a little bit confusing to me too. After I read the explanation, it started to make more sense. However, I am still not sure `inferred` is the right word. Technically, we infer defaults for all columns (after this limit it just becomes `none`). To me, this is more about limiting the number of columns for which we persist metrics by default. Can the property name revolve around `persist` rather than `infer`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916012419


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -195,6 +198,16 @@ private static MetricsMode sortedColumnDefaultMode(MetricsMode defaultMode) {
     }
   }
 
+  private static MetricsMode parseMode(String modeString, MetricsMode fallback, String context) {
+    try {
+      return MetricsModes.fromString(modeString);
+    } catch (IllegalArgumentException err) {
+      // User override was invalid, log the error and use the default
+      LOG.warn("Ignoring invalid metrics mode ({}): {}", context, modeString, err);

Review Comment:
   If the last argument is a Throwable, the logger does the right thing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916029680


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -267,6 +267,10 @@ private TableProperties() {
   public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
   public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;
 
+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =

Review Comment:
   The problem with "persist" and similar is that it misses the distinction between an explicit default (when `write.metadata.metrics.default` is set) and an implicit default that comes from Iceberg. I think the right behavior is to preserve what we currently do, which is to use the explicit default for all columns. But that means that this property should obviously not apply to the explicit default. That's why I used "inferred default".
   
   What about changing this to include `unconfigured` or `missing`? Something like `missing-mode-limit`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915473427


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit

Review Comment:
   nit: `a` -> `an`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916006986


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   Yes, but this is the current behavior. We use the top-level columns for the current check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915472280


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,

Review Comment:
   nit: I think we have static imports for all other table properties in this class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916072054


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,

Review Comment:
   Looks like invalid from, to will trigger exception in sublist(from, to) anyway, was thinking a precondition would make the message clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915265884


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -267,6 +267,10 @@ private TableProperties() {
   public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
   public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;
 
+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =

Review Comment:
   nit: `INFERRED` doesn't really make sense to me in this context.  We're not inferring, we're actually explicit here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916019164


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915271111


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -267,6 +267,10 @@ private TableProperties() {
   public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
   public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;
 
+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =

Review Comment:
   I debated what to call this. The problem is that we are defaulting the default. You can explicitly set a default using `write.metadata.metrics.default`, and that will apply to all columns. If you don't _set_ a default we _infer_ one for you, but only for the first 32 columns (at least, after this PR). That's why I used "inferred" -- I thought it was better than `max-defaulted-default-columns`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915289730


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -195,6 +198,16 @@ private static MetricsMode sortedColumnDefaultMode(MetricsMode defaultMode) {
     }
   }
 
+  private static MetricsMode parseMode(String modeString, MetricsMode fallback, String context) {
+    try {
+      return MetricsModes.fromString(modeString);
+    } catch (IllegalArgumentException err) {
+      // User override was invalid, log the error and use the default
+      LOG.warn("Ignoring invalid metrics mode ({}): {}", context, modeString, err);

Review Comment:
   This may not log err?  (It looks like the parameterized version of log.warn)



##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -267,6 +267,10 @@ private TableProperties() {
   public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
   public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;
 
+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =

Review Comment:
   Agree with this table property, initially I had made one but it was taken out during the discussions.  Indeed it's a bit of a confusing config, but I dont see any other great option.



##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,

Review Comment:
   Add precondition that its >= 0?



##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   OK so we may have still more than 32 type metrics (if chosen 32 columns are nested for example).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915277699


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   No, we keep the first 32 columns, but we add a metrics mode for each sub-type. That's why the loop below iterates over `TypeUtil.getProjectedIds(subSchema)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915475413


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   What if I have highly nested schema? The number of stored metrics can be way more than 32 in that case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916019375


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915476686


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));
+      for (Integer id : TypeUtil.getProjectedIds(subSchema)) {
+        columnModes.put(subSchema.findColumnName(id), DEFAULT_MODE);
+      }
+
+      // all other columns don't use metrics
+      defaultMode = MetricsModes.None.get();
     }
 
     // First set sorted column with sorted column default (can be overridden by user)
-    MetricsMode sortedColDefaultMode = sortedColumnDefaultMode(finalDefaultMode);
+    MetricsMode sortedColDefaultMode = sortedColumnDefaultMode(defaultMode);
     Set<String> sortedCols = SortOrderUtil.orderPreservingSortedColumns(order);
     sortedCols.forEach(sc -> columnModes.put(sc, sortedColDefaultMode));
 
     // Handle user overrides of defaults
-    MetricsMode finalDefaultModeVal = finalDefaultMode;
-    props.keySet().stream()
-        .filter(key -> key.startsWith(METRICS_MODE_COLUMN_CONF_PREFIX))
-        .forEach(key -> {
-          String columnAlias = key.replaceFirst(METRICS_MODE_COLUMN_CONF_PREFIX, "");
-          MetricsMode mode;
-          try {
-            mode = MetricsModes.fromString(props.get(key));
-          } catch (IllegalArgumentException err) {
-            // Mode was invalid, log the error and use the default
-            LOG.warn("Ignoring invalid metrics mode for column {}: {}", columnAlias, props.get(key), err);
-            mode = finalDefaultModeVal;
-          }
-          columnModes.put(columnAlias, mode);
-        });
+    MetricsMode finalDefaultModeVal = defaultMode;

Review Comment:
   Do we still need `finalDefaultModeVal`? Can we use `defaultMode` instead now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916019713


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   This is a temporary schema so it should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916018763


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,

Review Comment:
   I'm adding a warning, but I don't think that we should fail because of an invalid value here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#issuecomment-1180914739

   I think that's an interesting enhancement idea, if I understand, to make write.metadata.metrics.max-inferred-column-defaults to also accept first or last.  An organization might be able to set last(32) as default on the catalog level (ie, https://github.com/apache/iceberg/pull/4011).  On the other hand, sort and partition columns are already auto-promoted to have stats today, curious do you see a good impact of having stats on other columns?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#issuecomment-1182228206

   Thinking about it, I guess there are always correlation in the data, maybe some column values are still able to be used for filtering if you sort by/partition by correlated columns.  
   
   So, it makes sense to me to add more support like first/last, given we already are advertising this option.  Will leave to everyone's opinion if the default needs to be changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916019988


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));
+      for (Integer id : TypeUtil.getProjectedIds(subSchema)) {
+        columnModes.put(subSchema.findColumnName(id), DEFAULT_MODE);
+      }
+
+      // all other columns don't use metrics
+      defaultMode = MetricsModes.None.get();
     }
 
     // First set sorted column with sorted column default (can be overridden by user)
-    MetricsMode sortedColDefaultMode = sortedColumnDefaultMode(finalDefaultMode);
+    MetricsMode sortedColDefaultMode = sortedColumnDefaultMode(defaultMode);
     Set<String> sortedCols = SortOrderUtil.orderPreservingSortedColumns(order);
     sortedCols.forEach(sc -> columnModes.put(sc, sortedColDefaultMode));
 
     // Handle user overrides of defaults
-    MetricsMode finalDefaultModeVal = finalDefaultMode;
-    props.keySet().stream()
-        .filter(key -> key.startsWith(METRICS_MODE_COLUMN_CONF_PREFIX))
-        .forEach(key -> {
-          String columnAlias = key.replaceFirst(METRICS_MODE_COLUMN_CONF_PREFIX, "");
-          MetricsMode mode;
-          try {
-            mode = MetricsModes.fromString(props.get(key));
-          } catch (IllegalArgumentException err) {
-            // Mode was invalid, log the error and use the default
-            LOG.warn("Ignoring invalid metrics mode for column {}: {}", columnAlias, props.get(key), err);
-            mode = finalDefaultModeVal;
-          }
-          columnModes.put(columnAlias, mode);
-        });
+    MetricsMode finalDefaultModeVal = defaultMode;

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r916156848


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,

Review Comment:
   I added a warning and set it to the default if it is invalid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
bryanck commented on PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#issuecomment-1179546876

   This is already merged, but I thought I'd leave feedback anyway, in case it is useful.
   
   As a data engineer, many tables I have maintained have more than 32 top-level columns. Often columns used for partitioning, sorting, auditing, and so forth are put at the end of a table schema, but these are some of the most frequently used in filtering. Also, additional columns are generally added at the end of the schema. The assumption that the first columns in a table schema are the most important to have stats on is not always accurate. 
   
   In testing 0.14, I ran into missing stats on tables, which was confusing and difficult to debug. I image those new to Iceberg and who are most likely to leave settings at the default, it would be even more confusing.
   
   I feel a more sensible default is to leave it the same as previous Iceberg versions (i.e. no column limit). Then an option could be introduced to limit the number of columns so those that prefer can set it on their tables, e.g. "first(32)". I feel it is better to err on the side of too many stats and dial that back as needed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r915276052


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -136,50 +131,58 @@ public static MetricsConfig forPositionDelete(Table table) {
   }
 
   /**
-   * Generate a MetricsConfig for all columns based on overrides, sortOrder, and defaultMode.
+   * Generate a MetricsConfig for all columns based on overrides, schema, and sort order.
+   *
    * @param props will be read for metrics overrides (write.metadata.metrics.column.*) and default
    *              (write.metadata.metrics.default)
+   * @param schema table schema
    * @param order sort order columns, will be promoted to truncate(16)
-   * @param defaultMode default, if not set by user property
    * @return metrics configuration
    */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS,
+        TableProperties.METRICS_MAX_INFERRED_COLUMN_DEFAULTS_DEFAULT);
     Map<String, MetricsMode> columnModes = Maps.newHashMap();
 
     // Handle user override of default mode
-    MetricsMode finalDefaultMode;
-    String defaultModeAsString = props.getOrDefault(DEFAULT_WRITE_METRICS_MODE, defaultMode);
-    try {
-      finalDefaultMode = MetricsModes.fromString(defaultModeAsString);
-    } catch (IllegalArgumentException err) {
-      // User override was invalid, log the error and use the default
-      LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeAsString, err);
-      finalDefaultMode = MetricsModes.fromString(defaultMode);
+    MetricsMode defaultMode;
+    String configuredDefault = props.get(DEFAULT_WRITE_METRICS_MODE);
+    if (configuredDefault != null) {
+      // a user-configured default mode is applied for all columns
+      defaultMode = parseMode(configuredDefault, DEFAULT_MODE, "default");
+
+    } else if (schema == null || schema.columns().size() <= maxInferredDefaultColumns) {
+      // there are less than the inferred limit, so the default is used everywhere
+      defaultMode = DEFAULT_MODE;
+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));

Review Comment:
   Wouldn't this still cause an issue with metrics for nested types?  Once we go cross the column threshold we will actually drop nested stats and only include top-level column stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findinpath commented on a diff in pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Posted by GitBox <gi...@apache.org>.
findinpath commented on code in PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#discussion_r966644642


##########
core/src/main/java/org/apache/iceberg/MetricsConfig.java:
##########
@@ -45,9 +49,8 @@ public final class MetricsConfig implements Serializable {
   private static final Joiner DOT = Joiner.on('.');
 
   // Disable metrics by default for wide tables to prevent excessive metadata

Review Comment:
   @danielcweeks  Is this a left-over comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org