You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/27 05:41:56 UTC

[GitHub] [spark] huaxingao opened a new pull request, #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

huaxingao opened a new pull request, #36696:
URL: https://github.com/apache/spark/pull/36696

   
   
   ### What changes were proposed in this pull request?
   Since now Parquet supports its native in predicate, we want to simplify the current In filter pushdown using Parquet's native in predicate.
   
   
   ### Why are the changes needed?
   code enhancement
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   modify the existing tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36696:
URL: https://github.com/apache/spark/pull/36696#discussion_r884339449


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -725,26 +757,8 @@ class ParquetFilters(
 
       case sources.In(name, values) if pushDownInFilterThreshold > 0 && values.nonEmpty &&
           canMakeFilterOn(name, values.head) =>
-        val fieldType = nameToParquetField(name).fieldType
-        val fieldNames = nameToParquetField(name).fieldNames
-        if (values.length <= pushDownInFilterThreshold) {
-          values.distinct.flatMap { v =>
-            makeEq.lift(fieldType).map(_(fieldNames, v))
-          }.reduceLeftOption(FilterApi.or)

Review Comment:
   I will change back to use equal Or equal if the number of values in the `In` set <=  `pushDownInFilterThreshold`. Only use parquet `In` predicate if the number of values in the `In` set >  `pushDownInFilterThreshold`. The existing tests coverage should be good, just need to change the expected type back to equal Or equal when the number of values in the `In` set <=  `pushDownInFilterThreshold`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #36696:
URL: https://github.com/apache/spark/pull/36696#discussion_r884550646


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -440,94 +440,126 @@ class ParquetFilters(
   }
 
   private val makeInPredicate:
-    PartialFunction[ParquetSchemaType,
-      (Array[String], Array[Any], ParquetStatistics[_]) => FilterPredicate] = {
+    PartialFunction[ParquetSchemaType, (Array[String], Any) => FilterPredicate] = {
+    case ParquetBooleanType =>
+      (n: Array[String], v: Any) =>
+        val values = Option(v).map(_.asInstanceOf[Array[Object]]).orNull

Review Comment:
   If `orNull` is hit, `for (value <- values) { ... }`  will throw NPE
   
   
   
   ornull 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1146911640

   @wangyum is helping me testing this because he has lots of `in/notIn` test cases. I will change this PR to draft for now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36696:
URL: https://github.com/apache/spark/pull/36696#discussion_r883404190


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -725,26 +757,8 @@ class ParquetFilters(
 
       case sources.In(name, values) if pushDownInFilterThreshold > 0 && values.nonEmpty &&
           canMakeFilterOn(name, values.head) =>
-        val fieldType = nameToParquetField(name).fieldType
-        val fieldNames = nameToParquetField(name).fieldNames
-        if (values.length <= pushDownInFilterThreshold) {
-          values.distinct.flatMap { v =>
-            makeEq.lift(fieldType).map(_(fieldNames, v))
-          }.reduceLeftOption(FilterApi.or)

Review Comment:
   It seems a regression, For example:
   ```
   In(id, 1, 3, 5, 100000)
   ```
   
   Before this PR, the pushded predicate is: id = 1 or id = 3 or id = 5 or id = 100000,
   the current pushded predicate is: id >= 1 and id <= 100000



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1163943147

   What is the next step for us, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1163962503

   @dongjoon-hyun Thanks for the ping. Give me a couple of more days. I want to check one more time before mark this ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36696:
URL: https://github.com/apache/spark/pull/36696#discussion_r886336186


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -440,94 +440,126 @@ class ParquetFilters(
   }
 
   private val makeInPredicate:
-    PartialFunction[ParquetSchemaType,
-      (Array[String], Array[Any], ParquetStatistics[_]) => FilterPredicate] = {
+    PartialFunction[ParquetSchemaType, (Array[String], Any) => FilterPredicate] = {
+    case ParquetBooleanType =>
+      (n: Array[String], v: Any) =>
+        val values = Option(v).map(_.asInstanceOf[Array[Object]]).orNull

Review Comment:
   Thanks for taking a look. I don't think `v` could be null, but this line is actually not needed, so I deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1157221620

   @wangyum Thank you very much for helping me test this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1163974482

   No problem~ Thank you for informing that, @huaxingao . Take your time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1186233855

   Thanks you all very much!  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1186100608

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1139309073

   cc @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1179338350

   Gentle ping, @huaxingao .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1186419071

   Thank you, @huaxingao and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1182027384

   Could you re-trigger once more, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down
URL: https://github.com/apache/spark/pull/36696


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1146989466

   Please wait me several days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1175662846

   Could you update the `FilterPushdownBenchmark` results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1157212653

   I have tested it more than 2 weeks and no data issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #36696:
URL: https://github.com/apache/spark/pull/36696#discussion_r884327960


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -725,26 +757,8 @@ class ParquetFilters(
 
       case sources.In(name, values) if pushDownInFilterThreshold > 0 && values.nonEmpty &&
           canMakeFilterOn(name, values.head) =>
-        val fieldType = nameToParquetField(name).fieldType
-        val fieldNames = nameToParquetField(name).fieldNames
-        if (values.length <= pushDownInFilterThreshold) {
-          values.distinct.flatMap { v =>
-            makeEq.lift(fieldType).map(_(fieldNames, v))
-          }.reduceLeftOption(FilterApi.or)

Review Comment:
   Could you add a test coverage for @wangyum 's case, please, @huaxingao ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36696:
URL: https://github.com/apache/spark/pull/36696#issuecomment-1146909697

   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org