You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@paimon.apache.org by "shidayang (via GitHub)" <gi...@apache.org> on 2023/11/23 09:28:30 UTC

[PR] [Core]: Support value filter push down in max file level [incubator-paimon]

shidayang opened a new pull request, #2381:
URL: https://github.com/apache/incubator-paimon/pull/2381

   ### Purpose
   Linked issue: close #1731 
   
   ### Tests
   org.apache.paimon.operation.KeyValueFileStoreScanTest#testWithValueFilterForMaxLevel
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [Core]: Support value filter push down in max file level for keyed table [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.
JingsongLi commented on code in PR #2381:
URL: https://github.com/apache/incubator-paimon/pull/2381#discussion_r1403923699


##########
paimon-core/src/main/java/org/apache/paimon/operation/KeyValueFileStoreScan.java:
##########
@@ -97,23 +103,50 @@ protected boolean filterByStats(ManifestEntry entry) {
 
     /** Note: Keep this thread-safe. */
     @Override
-    protected boolean filterWholeBucketByStats(List<ManifestEntry> entries) {
+    protected List<ManifestEntry> filterWholeBucketByStats(List<ManifestEntry> entries) {
+        if (valueFilter == null) {
+            return entries;
+        }
+
         // entries come from the same bucket, if any of it doesn't meet the request, we could filter
         // the bucket.
-        if (valueFilter != null) {
-            for (ManifestEntry entry : entries) {
-                if (valueFilter.test(
-                        entry.file().rowCount(),
-                        entry.file()
-                                .valueStats()
-                                .fields(
-                                        fieldValueStatsConverters.getOrCreate(
-                                                entry.file().schemaId()),
-                                        entry.file().rowCount()))) {
-                    return true;
-                }
+        if (canSkipWholeBucket(entries)) {
+            return Collections.emptyList();
+        }
+
+        // If there is no custom sequence field, we can filter the whole bucket by the max level.
+        if (!hasCustomSequenceField) {

Review Comment:
   You can add some ITCase to cover this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [Core]: Support value filter push down in max file level for keyed table [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.
JingsongLi closed pull request #2381: [Core]: Support value filter push down in max file level for keyed table
URL: https://github.com/apache/incubator-paimon/pull/2381


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [Core]: Support value filter push down in max file level for keyed table [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.
JingsongLi commented on code in PR #2381:
URL: https://github.com/apache/incubator-paimon/pull/2381#discussion_r1403923255


##########
paimon-core/src/main/java/org/apache/paimon/operation/KeyValueFileStoreScan.java:
##########
@@ -97,23 +103,50 @@ protected boolean filterByStats(ManifestEntry entry) {
 
     /** Note: Keep this thread-safe. */
     @Override
-    protected boolean filterWholeBucketByStats(List<ManifestEntry> entries) {
+    protected List<ManifestEntry> filterWholeBucketByStats(List<ManifestEntry> entries) {
+        if (valueFilter == null) {
+            return entries;
+        }
+
         // entries come from the same bucket, if any of it doesn't meet the request, we could filter
         // the bucket.
-        if (valueFilter != null) {
-            for (ManifestEntry entry : entries) {
-                if (valueFilter.test(
-                        entry.file().rowCount(),
-                        entry.file()
-                                .valueStats()
-                                .fields(
-                                        fieldValueStatsConverters.getOrCreate(
-                                                entry.file().schemaId()),
-                                        entry.file().rowCount()))) {
-                    return true;
-                }
+        if (canSkipWholeBucket(entries)) {
+            return Collections.emptyList();
+        }
+
+        // If there is no custom sequence field, we can filter the whole bucket by the max level.
+        if (!hasCustomSequenceField) {

Review Comment:
   I think the bool should be more complicated.
   1. minSequenceNumber in non-max levels should larger than maxSequenceNumber in max levels.
   2. Merge Engine is deduplicate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [Core]: Support value filter push down in max file level for keyed table [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.
JingsongLi commented on code in PR #2381:
URL: https://github.com/apache/incubator-paimon/pull/2381#discussion_r1403923410


##########
paimon-core/src/main/java/org/apache/paimon/operation/KeyValueFileStoreScan.java:
##########
@@ -97,23 +103,50 @@ protected boolean filterByStats(ManifestEntry entry) {
 
     /** Note: Keep this thread-safe. */
     @Override
-    protected boolean filterWholeBucketByStats(List<ManifestEntry> entries) {
+    protected List<ManifestEntry> filterWholeBucketByStats(List<ManifestEntry> entries) {
+        if (valueFilter == null) {
+            return entries;
+        }
+
         // entries come from the same bucket, if any of it doesn't meet the request, we could filter
         // the bucket.
-        if (valueFilter != null) {
-            for (ManifestEntry entry : entries) {
-                if (valueFilter.test(
-                        entry.file().rowCount(),
-                        entry.file()
-                                .valueStats()
-                                .fields(
-                                        fieldValueStatsConverters.getOrCreate(
-                                                entry.file().schemaId()),
-                                        entry.file().rowCount()))) {
-                    return true;
-                }
+        if (canSkipWholeBucket(entries)) {
+            return Collections.emptyList();
+        }
+
+        // If there is no custom sequence field, we can filter the whole bucket by the max level.
+        if (!hasCustomSequenceField) {
+            @SuppressWarnings("OptionalGetWithoutIsPresent")
+            int maxLevel =
+                    entries.stream().mapToInt(entry -> entry.file().level()).max().getAsInt();
+
+            // Level 0 can't be filtered.
+            if (maxLevel > 0) {
+                return entries.stream()
+                        .filter(entry -> entry.file().level() < maxLevel || applyValueFilter(entry))
+                        .collect(Collectors.toList());
+            }
+        }
+
+        return entries;
+    }
+
+    private boolean canSkipWholeBucket(List<ManifestEntry> entries) {
+        for (ManifestEntry entry : entries) {
+            if (applyValueFilter(entry)) {

Review Comment:
   Can we cache this value, and reuse this value for max level filter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [Core]: Support value filter push down in max file level for keyed table [incubator-paimon]

Posted by "JingsongLi (via GitHub)" <gi...@apache.org>.
JingsongLi commented on PR #2381:
URL: https://github.com/apache/incubator-paimon/pull/2381#issuecomment-1975143432

   Hi @shidayang .
   I still  think it is very dangarous to do filter on max level. Alternatively, we can determine whether the current bucket does not need to be merged in order to perform a filter pushdown. See #2922
   
   You can re-open this if you have more questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org