You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/02/08 21:41:31 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6776: Spark 3.3: Improve log messages in scans

aokolnychyi opened a new pull request, #6776:
URL: https://github.com/apache/iceberg/pull/6776

   While debugging storage-partitions joins, I realized our log messages in scans don't include table names. It makes hard to match log messages with tables if queries touch multiple tables. This PR changes slightly improves log messages in scans.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#issuecomment-1423279323

   cc @RussellSpitzer @rdblue @szehon-ho @flyrain @karuppayya @amogh-jahagirdar @singhpk234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100787732


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -140,7 +140,7 @@ protected Statistics estimateStatistics(Snapshot snapshot) {
     // estimate stats using snapshot summary only for partitioned tables
     // (metadata tables are unpartitioned)
     if (!table.spec().isUnpartitioned() && filterExpressions.isEmpty()) {
-      LOG.debug("Using table metadata to estimate table statistics");
+      LOG.debug("Using snapshot metadata to estimate statistics for table {}", table.name());

Review Comment:
   Agreed, it'll help to log the snapshot here. With all the new time travel options being added (like branching and tagging), in case of any issues it'll be useful to be able to debug just by looking at the logs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1101840634


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(

Review Comment:
   Oh I see, because most files have date/timestamp in name, and those get sanitized.  Makes sense to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100712967


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java:
##########
@@ -132,11 +132,6 @@ public void filter(Filter[] filters) {
         evaluatorsBySpecId.put(spec.specId(), inclusive);
       }
 
-      LOG.info(

Review Comment:
   I don't know why I added it in the first place. Seems fairly redundant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100981238


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(

Review Comment:
   Unlike `SparkBatchQueryScan`, this is a special type of filtering where we know it is an IN filter with file locations. If we sanitize the expression, logging it will be useless. I added the number of files to the log, though.
   
   Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100788235


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(
+              "{}/{} tasks for table {} matched runtime file filter",
+              filteredTasks.size(),
+              tasks().size(),
+              table().name());
+
           resetTasks(filteredTasks);
         }
+      } else {
+        LOG.warn("Unsupported runtime filter {}", filter);
       }

Review Comment:
   Just wondering should this really be a warn? Previously this seemed like an acceptable case , just wondering if it'll confuse users who look at the logs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100985955


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -140,7 +140,7 @@ protected Statistics estimateStatistics(Snapshot snapshot) {
     // estimate stats using snapshot summary only for partitioned tables
     // (metadata tables are unpartitioned)
     if (!table.spec().isUnpartitioned() && filterExpressions.isEmpty()) {
-      LOG.debug("Using table metadata to estimate table statistics");
+      LOG.debug("Using snapshot metadata to estimate statistics for table {}", table.name());

Review Comment:
   Makes sense, added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100776061


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -140,7 +140,7 @@ protected Statistics estimateStatistics(Snapshot snapshot) {
     // estimate stats using snapshot summary only for partitioned tables
     // (metadata tables are unpartitioned)
     if (!table.spec().isUnpartitioned() && filterExpressions.isEmpty()) {
-      LOG.debug("Using table metadata to estimate table statistics");
+      LOG.debug("Using snapshot metadata to estimate statistics for table {}", table.name());

Review Comment:
   Should we also log snapshot id?



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(

Review Comment:
   What do you think put filter in this log as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100993287


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(
+              "{}/{} tasks for table {} matched runtime file filter",
+              filteredTasks.size(),
+              tasks().size(),
+              table().name());
+
           resetTasks(filteredTasks);
         }
+      } else {
+        LOG.warn("Unsupported runtime filter {}", filter);
       }

Review Comment:
   Got it thanks for the explanation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100788235


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(
+              "{}/{} tasks for table {} matched runtime file filter",
+              filteredTasks.size(),
+              tasks().size(),
+              table().name());
+
           resetTasks(filteredTasks);
         }
+      } else {
+        LOG.warn("Unsupported runtime filter {}", filter);
       }

Review Comment:
   Should this be a warn? Previously this seemed like an acceptable case , just wondering if it'll confuse users who look at the logs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100985599


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(
+              "{}/{} tasks for table {} matched runtime file filter",
+              filteredTasks.size(),
+              tasks().size(),
+              table().name());
+
           resetTasks(filteredTasks);
         }
+      } else {
+        LOG.warn("Unsupported runtime filter {}", filter);

Review Comment:
   Well, I wasn't sure how useful it will be in this context. The table name does not impact whether a filter can be converted, so I matched what we have in other places that convert filters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100982048


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(
+              "{}/{} tasks for table {} matched runtime file filter",
+              filteredTasks.size(),
+              tasks().size(),
+              table().name());
+
           resetTasks(filteredTasks);
         }
+      } else {
+        LOG.warn("Unsupported runtime filter {}", filter);
       }

Review Comment:
   I overlooked this in the initial implementation. There should be no use case where we can't convert a runtime filter for the copy-on-write scan. If it happens, we should warn the user.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi merged PR #6776:
URL: https://github.com/apache/iceberg/pull/6776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on code in PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#discussion_r1100972549


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java:
##########
@@ -123,8 +127,17 @@ public void filter(Filter[] filters) {
               tasks().stream()
                   .filter(file -> fileLocations.contains(file.file().path().toString()))
                   .collect(Collectors.toList());
+
+          LOG.info(
+              "{}/{} tasks for table {} matched runtime file filter",
+              filteredTasks.size(),
+              tasks().size(),
+              table().name());
+
           resetTasks(filteredTasks);
         }
+      } else {
+        LOG.warn("Unsupported runtime filter {}", filter);

Review Comment:
   should we include concerned table name as well in log



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6776: Spark 3.3: Improve log messages in scans

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #6776:
URL: https://github.com/apache/iceberg/pull/6776#issuecomment-1424623094

   Thanks for reviewing, @szehon-ho @amogh-jahagirdar @singhpk234!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org