You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/09 10:28:33 UTC

[GitHub] [iceberg] izchen opened a new pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

izchen opened a new pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863


   **Add Spark UI metrics for merge into DynamicFileFilterExec:**
   
   1. total number of files
   2. number of files hit
   3. file hit rate %
   
   **Spark UI:**
   
   ![FileFilterMetric](https://user-images.githubusercontent.com/67534059/148678316-159bf323-3c9c-42b6-9feb-7cbb4d44c788.jpg)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780839944



##########
File path: spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DynamicFileFilterExec.scala
##########
@@ -41,6 +43,11 @@ abstract class DynamicFileFilterExecBase(
   @transient
   override lazy val references: AttributeSet = AttributeSet(fileFilterExec.output)
 
+  override lazy val metrics = Map(
+    "totalFiles" -> SQLMetrics.createMetric(sparkContext, "total number of files"),
+    "hitFiles" -> SQLMetrics.createMetric(sparkContext, "number of files hit"),
+    "fileHitRate" -> SQLMetrics.createMetric(sparkContext, "file hit rate %"))

Review comment:
       I'm not sure that a derived metric is very valuable. I'd probably remove this and just keep the candidate and matching data files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] izchen commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
izchen commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r781809578



##########
File path: spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DynamicFileFilterExec.scala
##########
@@ -41,6 +43,11 @@ abstract class DynamicFileFilterExecBase(
   @transient
   override lazy val references: AttributeSet = AttributeSet(fileFilterExec.output)
 
+  override lazy val metrics = Map(
+    "totalFiles" -> SQLMetrics.createMetric(sparkContext, "total number of files"),
+    "hitFiles" -> SQLMetrics.createMetric(sparkContext, "number of files hit"),
+    "fileHitRate" -> SQLMetrics.createMetric(sparkContext, "file hit rate %"))

Review comment:
       done

##########
File path: spark/v3.0/spark/src/main/java/org/apache/spark/sql/connector/iceberg/read/SupportsFileFilter.java
##########
@@ -32,5 +32,22 @@
    *
    * @param locations file locations
    */
-  void filterFiles(Set<String> locations);
+  SupportsFileFilter.FileFilterMetric filterFiles(Set<String> locations);
+
+  class FileFilterMetric {
+    private int totalFilesNumber;
+    private int hitFilesNumber;
+
+    public FileFilterMetric(int totalFilesNumber, int hitFilesNumber) {
+      this.totalFilesNumber = totalFilesNumber;
+      this.hitFilesNumber = hitFilesNumber;
+    }
+
+    public int getTotalFilesNumber() {
+      return totalFilesNumber;
+    }
+    public int getHitFilesNumber() {

Review comment:
       done

##########
File path: spark/v3.0/spark/src/main/java/org/apache/spark/sql/connector/iceberg/read/SupportsFileFilter.java
##########
@@ -32,5 +32,22 @@
    *
    * @param locations file locations
    */
-  void filterFiles(Set<String> locations);
+  SupportsFileFilter.FileFilterMetric filterFiles(Set<String> locations);
+
+  class FileFilterMetric {
+    private int totalFilesNumber;
+    private int hitFilesNumber;
+
+    public FileFilterMetric(int totalFilesNumber, int hitFilesNumber) {
+      this.totalFilesNumber = totalFilesNumber;
+      this.hitFilesNumber = hitFilesNumber;
+    }
+
+    public int getTotalFilesNumber() {

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#issuecomment-1010212711


   Do you also want to open a PR to port this to Spark 3.1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780840504



##########
File path: spark/v3.0/spark/src/main/java/org/apache/spark/sql/connector/iceberg/read/SupportsFileFilter.java
##########
@@ -32,5 +32,22 @@
    *
    * @param locations file locations
    */
-  void filterFiles(Set<String> locations);
+  SupportsFileFilter.FileFilterMetric filterFiles(Set<String> locations);

Review comment:
       What about using a `Pair<Integer, Integer>` instead of defining a new class?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780840017



##########
File path: spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DynamicFileFilterExec.scala
##########
@@ -78,6 +85,16 @@ abstract class DynamicFileFilterExecBase(
   override def simpleString(maxFields: Int): String = {
     s"DynamicFileFilterExec${truncatedString(output, "[", ", ", "]", maxFields)}"
   }
+
+  def postFileFilterMetric(totalFilesNumber: Int, hitFilesNumber: Int): Unit = {
+    longMetric("totalFiles").set(totalFilesNumber)
+    longMetric("hitFiles").set(hitFilesNumber)
+    if (totalFilesNumber > 0) {
+      longMetric("fileHitRate").set(hitFilesNumber * 100 / totalFilesNumber)
+    }
+    val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
+    SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, metrics.values.toSeq)

Review comment:
       Can you explain this a bit more? What's going on with this line, sending updates to the driver?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780840180



##########
File path: spark/v3.0/spark/src/main/java/org/apache/spark/sql/connector/iceberg/read/SupportsFileFilter.java
##########
@@ -32,5 +32,22 @@
    *
    * @param locations file locations
    */
-  void filterFiles(Set<String> locations);
+  SupportsFileFilter.FileFilterMetric filterFiles(Set<String> locations);
+
+  class FileFilterMetric {
+    private int totalFilesNumber;
+    private int hitFilesNumber;
+
+    public FileFilterMetric(int totalFilesNumber, int hitFilesNumber) {
+      this.totalFilesNumber = totalFilesNumber;
+      this.hitFilesNumber = hitFilesNumber;
+    }
+
+    public int getTotalFilesNumber() {
+      return totalFilesNumber;
+    }
+    public int getHitFilesNumber() {

Review comment:
       Style: there should be a newline separating methods.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] izchen commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
izchen commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r781813118



##########
File path: spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DynamicFileFilterExec.scala
##########
@@ -78,6 +85,16 @@ abstract class DynamicFileFilterExecBase(
   override def simpleString(maxFields: Int): String = {
     s"DynamicFileFilterExec${truncatedString(output, "[", ", ", "]", maxFields)}"
   }
+
+  def postFileFilterMetric(totalFilesNumber: Int, hitFilesNumber: Int): Unit = {
+    longMetric("totalFiles").set(totalFilesNumber)
+    longMetric("hitFiles").set(hitFilesNumber)
+    if (totalFilesNumber > 0) {
+      longMetric("fileHitRate").set(hitFilesNumber * 100 / totalFilesNumber)
+    }
+    val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
+    SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, metrics.values.toSeq)

Review comment:
       > Updates metrics based on the driver side value. This is useful for certain metrics that are only updated on the driver, e.g. subquery execution time, or number of files.
   
   The metrics of BroadcastExchangeExec are updated in this way: 
   
   https://github.com/apache/spark/blob/3d2fde5242c8989688c176b8ed5eb0bff5e1f17f/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L169-L174




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] izchen commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
izchen commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r781809317



##########
File path: spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DynamicFileFilterExec.scala
##########
@@ -41,6 +43,11 @@ abstract class DynamicFileFilterExecBase(
   @transient
   override lazy val references: AttributeSet = AttributeSet(fileFilterExec.output)
 
+  override lazy val metrics = Map(
+    "totalFiles" -> SQLMetrics.createMetric(sparkContext, "total number of files"),

Review comment:
       I modified the strings to "candidate files" and "matching files".
   
   But I probably won't modify to capitalize. This is for consistency with other metric formats.
   
   ![metric_formats](https://user-images.githubusercontent.com/67534059/148897414-22cc321a-fcd9-4662-8f1a-683ed590e693.jpg)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780840210



##########
File path: spark/v3.0/spark/src/main/java/org/apache/spark/sql/connector/iceberg/read/SupportsFileFilter.java
##########
@@ -32,5 +32,22 @@
    *
    * @param locations file locations
    */
-  void filterFiles(Set<String> locations);
+  SupportsFileFilter.FileFilterMetric filterFiles(Set<String> locations);
+
+  class FileFilterMetric {
+    private int totalFilesNumber;
+    private int hitFilesNumber;
+
+    public FileFilterMetric(int totalFilesNumber, int hitFilesNumber) {
+      this.totalFilesNumber = totalFilesNumber;
+      this.hitFilesNumber = hitFilesNumber;
+    }
+
+    public int getTotalFilesNumber() {

Review comment:
       Style: Iceberg does not use `get` in method names.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#issuecomment-1008437317


   Thanks, @izchen! This looks really useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780840320



##########
File path: spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java
##########
@@ -199,6 +205,36 @@ protected void withSQLConf(Map<String, String> conf, Action action) {
     }
   }
 
+  private Set<SQLExecutionUIData> currentExecutionUIDataSet() throws TimeoutException {
+    spark.sparkContext().listenerBus().waitUntilEmpty(10000);
+    return Sets.newHashSet(JavaConverters.seqAsJavaList(spark.sharedState().statusStore().executionsList()));
+  }
+
+  protected void checkMetrics(Callable sparkCallable, Map<String, String> expectedMetrics) throws Exception {
+    Set<SQLExecutionUIData> originalExecutions = currentExecutionUIDataSet();
+    sparkCallable.call();
+    Set<SQLExecutionUIData> currentExecutions = currentExecutionUIDataSet();
+    currentExecutions.removeAll(originalExecutions);
+    Assert.assertEquals(currentExecutions.size(), 1);
+    SQLExecutionUIData currentExecution = currentExecutions.stream().findFirst().get();
+
+    Map<Long, String> metricsIds = Maps.newHashMap();
+    JavaConverters.seqAsJavaList(currentExecution.metrics()).stream().forEach(metricsDeclaration -> {
+      if (expectedMetrics.containsKey(metricsDeclaration.name())) {
+        metricsIds.put(metricsDeclaration.accumulatorId(), metricsDeclaration.name());
+      }
+    });
+    Assert.assertEquals("Expected metric name not match",
+        expectedMetrics.keySet(), Sets.newHashSet(metricsIds.values()));
+
+    Map<Object, String> currentMetrics =
+        JavaConverters.mapAsJavaMap(spark.sharedState().statusStore().executionMetrics(currentExecution.executionId()))
+            .entrySet().stream()
+            .filter(x -> metricsIds.containsKey(x.getKey()))
+            .collect(Collectors.toMap(x -> metricsIds.get(x.getKey()), x -> x.getValue()));

Review comment:
       Style: indentation is 4 spaces (2 indents) for continuation indent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r780839796



##########
File path: spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DynamicFileFilterExec.scala
##########
@@ -41,6 +43,11 @@ abstract class DynamicFileFilterExecBase(
   @transient
   override lazy val references: AttributeSet = AttributeSet(fileFilterExec.output)
 
+  override lazy val metrics = Map(
+    "totalFiles" -> SQLMetrics.createMetric(sparkContext, "total number of files"),

Review comment:
       Can you make these display names a bit shorter and capitalize them? I think they should be "Candidate files" and "Matching files". Those are descriptive, but short.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] izchen commented on a change in pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
izchen commented on a change in pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#discussion_r781813874



##########
File path: spark/v3.0/spark/src/main/java/org/apache/spark/sql/connector/iceberg/read/SupportsFileFilter.java
##########
@@ -32,5 +32,22 @@
    *
    * @param locations file locations
    */
-  void filterFiles(Set<String> locations);
+  SupportsFileFilter.FileFilterMetric filterFiles(Set<String> locations);

Review comment:
       I think the current way may be better readable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3863: Spark 3.0: Add Spark UI metrics for merge into DynamicFileFilterExec

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3863:
URL: https://github.com/apache/iceberg/pull/3863#issuecomment-1010212324


   Thanks, @izchen! This is a really helpful addition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org