You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/31 19:12:02 UTC

[GitHub] [spark] revans2 opened a new pull request, #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

revans2 opened a new pull request, #37747:
URL: https://github.com/apache/spark/pull/37747

   ### What changes were proposed in this pull request?
   This fixes SPARK-40280 by normalizing a parquet int/long that has optional metadata with it to look like the expected version that does not have the extra metadata.
   
   ## Why are the changes needed?
   This allows predicate push down in parquet to work when reading files that are complaint with the parquet specification, but different from what Spark writes.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   I added unit tests that cover this use case. I also did some manual testing on some queries to verify that less data is actually read after this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #37747:
URL: https://github.com/apache/spark/pull/37747#issuecomment-1236530424

   looks making sense to me too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #37747:
URL: https://github.com/apache/spark/pull/37747#issuecomment-1236528066

   cc @wangyum FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37747:
URL: https://github.com/apache/spark/pull/37747#discussion_r962472499


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 32 && p.getPrimitiveTypeName == PrimitiveTypeName.INT32 =>
+          null
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 64 && p.getPrimitiveTypeName == PrimitiveTypeName.INT64 =>

Review Comment:
   I would suggest rewriting the match statement like this, IMHO cleaner: 
   ```scala
   (p.getPrimitiveTypeName, p.getLogicalTypeAnnotation) match {
     case (INT32, intType: IntLogicalTypeAnnotation)
       if intType.getBitWidth() == 32 && intType.isSigned() => null
     case (INT64, intType: IntLogicalTypeAnnotation)
       if intType.getBitWidth() == 64 && intType.isSigned() => null
     case (_, otherType) => otherType
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] asfgit closed pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long
URL: https://github.com/apache/spark/pull/37747


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #37747:
URL: https://github.com/apache/spark/pull/37747#discussion_r962039648


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:
##########
@@ -76,7 +76,6 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}
  * within the test.
  */
 abstract class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSparkSession {
-

Review Comment:
   nit: unnecessary change?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:
##########
@@ -370,6 +369,38 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
     }
   }
 
+  test("filter pushdown - int SPARK-40280") {

Review Comment:
   nit: maybe `test("SPARK-40280: filter pushdown - int with annotation")`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37747:
URL: https://github.com/apache/spark/pull/37747#discussion_r962471116


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&

Review Comment:
   nit: `case la: IntLogicalTypeAnnotation`



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:
##########
@@ -370,6 +369,38 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
     }
   }
 
+  test("filter pushdown - int SPARK-40280") {

Review Comment:
   +1



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 32 && p.getPrimitiveTypeName == PrimitiveTypeName.INT32 =>
+          null
+        case la : IntLogicalTypeAnnotation if la.isSigned &&

Review Comment:
   nit: `case la: IntLogicalTypeAnnotation`



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&

Review Comment:
   Also, what does "la" stand for?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of

Review Comment:
   nit: `SPARK-40280: Signed 64 ...`.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not being set. SPARK-40280

Review Comment:
   nit: Remove `SPARK-40280` at the end.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 32 && p.getPrimitiveTypeName == PrimitiveTypeName.INT32 =>
+          null
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 64 && p.getPrimitiveTypeName == PrimitiveTypeName.INT64 =>

Review Comment:
   I would suggest rewriting it match statement like this, IMHO cleaner: 
   ```scala
   (p.getPrimitiveTypeName, p.getLogicalTypeAnnotation) match {
     case (INT32, intType: IntLogicalTypeAnnotation)
       if intType.getBitWidth() == 32 && intType.isSigned() => null
     case (INT64, intType: IntLogicalTypeAnnotation)
       if intType.getBitWidth() == 64 && intType.isSigned() => null
     case (_, otherType) => otherType
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #37747:
URL: https://github.com/apache/spark/pull/37747#issuecomment-1238585228

   LGTM. Thanks a lot for working on this! @revans2 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
tgravescs commented on PR #37747:
URL: https://github.com/apache/spark/pull/37747#issuecomment-1240759396

   merged to master, branch-3.3 and branch-3.2.  Thanks @revans2 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #37747:
URL: https://github.com/apache/spark/pull/37747#discussion_r962039704


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:
##########
@@ -370,6 +369,38 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
     }
   }
 
+  test("filter pushdown - int SPARK-40280") {
+    implicit val df = readResourceParquetFile("test-data/tagged_int.parquet")
+
+    val intAttr = df("_c0").expr
+    assert(intAttr.dataType === IntegerType)
+
+    checkFilterPredicate(intAttr.isNull, classOf[Eq[_]], Seq.empty[Row])
+    checkFilterPredicate(intAttr.isNotNull, classOf[NotEq[_]],
+      (1 to 4).map(i => Row.apply(i)))
+
+    checkFilterPredicate(intAttr === 1, classOf[Eq[_]], 1)
+    checkFilterPredicate(intAttr <=> 1, classOf[Eq[_]], 1)
+    checkFilterPredicate(intAttr =!= 1, classOf[NotEq[_]],
+      (2 to 4).map(i => Row.apply(i)))
+
+    checkFilterPredicate(intAttr < 2, classOf[Lt[_]], 1)
+    checkFilterPredicate(intAttr > 3, classOf[Gt[_]], 4)
+    checkFilterPredicate(intAttr <= 1, classOf[LtEq[_]], 1)
+    checkFilterPredicate(intAttr >= 4, classOf[GtEq[_]], 4)
+
+    checkFilterPredicate(Literal(1) === intAttr, classOf[Eq[_]], 1)
+    checkFilterPredicate(Literal(1) <=> intAttr, classOf[Eq[_]], 1)
+    checkFilterPredicate(Literal(2) > intAttr, classOf[Lt[_]], 1)
+    checkFilterPredicate(Literal(3) < intAttr, classOf[Gt[_]], 4)
+    checkFilterPredicate(Literal(1) >= intAttr, classOf[LtEq[_]], 1)
+    checkFilterPredicate(Literal(4) <= intAttr, classOf[GtEq[_]], 4)
+
+    checkFilterPredicate(!(intAttr < 4), classOf[GtEq[_]], 4)
+    checkFilterPredicate(intAttr < 2 || intAttr > 3, classOf[Operators.Or],
+      Seq(Row(1), Row(4)))

Review Comment:
   nit: shall we also test `In` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #37747:
URL: https://github.com/apache/spark/pull/37747#issuecomment-1235931948

   The changes look good to me. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] revans2 commented on pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Posted by GitBox <gi...@apache.org>.
revans2 commented on PR #37747:
URL: https://github.com/apache/spark/pull/37747#issuecomment-1238467818

   @huaxingao and @sadikovi I think I have addressed all of your review comments. Please take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org