You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "cashmand (via GitHub)" <gi...@apache.org> on 2024/03/25 14:35:41 UTC

[PR] [SPARK-47546] Improve validation and add tests. [spark]

cashmand opened a new pull request, #45703:
URL: https://github.com/apache/spark/pull/45703

### What changes were proposed in this pull request?

The parquet schema converter for Variant assumes that the first two fields in a parquet struct are the value and metadata for Variant, respectively. This isn't very robust, and could break if other writers wrote them in a different order, or if a future extension added new fields to the Variant structure. This PR adds more careful validation of the Variant column based on field names, and fails if there are any unexpected columns.

### Why are the changes needed?

Make Variant more robust to issues in the parquet schema.

### Does this PR introduce _any_ user-facing change?

No, Variant isn't released yet.

### How was this patch tested?

Added a unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45703:
URL: https://github.com/apache/spark/pull/45703#issuecomment-2024338004

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546] Improve validation and add tests. [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1537802607


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -179,9 +179,21 @@ class ParquetToSparkSchemaConverter(
     field match {
       case primitiveColumn: PrimitiveColumnIO => convertPrimitiveField(primitiveColumn, targetType)
       case groupColumn: GroupColumnIO if targetType.contains(VariantType) =>
+        if (groupColumn.getChildrenCount != 2 ) {
+          // We may allow more than two children in the future, so consider this unsupported.
+          throw QueryCompilationErrors.
+            parquetTypeUnsupportedYetError("variant with more than two fields")
+        }
+        val valueIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "value")
+        val metadataIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "metadata")

Review Comment:
   I don't have a strong opinion. In the test, I specifically made a test case that allows either order. I can't think of a strong reason to allow or forbid arbitrary ordering.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539595678


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +192,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {
+    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
+    // it invalid to read.
+    // Test cases:
+    // 1) A binary that has the correct parquet structure for Variant.
+    // 2) A binary that has the correct parquet structure in unexpected order. This is valid.

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539495963


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -404,6 +401,35 @@ class ParquetToSparkSchemaConverter(
     }
   }
 
+  private def convertVariantField(groupColumn: GroupColumnIO): ParquetColumn = {
+    if (groupColumn.getChildrenCount != 2 ) {
+      // We may allow more than two children in the future, so consider this unsupported.
+      throw QueryCompilationErrors.
+        parquetTypeUnsupportedYetError("variant with more than two fields")
+    }
+    // Find the binary columns, and validate that they have the correct type.
+    val valueAndMetadata = Seq("value", "metadata").map { colName =>
+      val idx = (0 until groupColumn.getChildrenCount)
+          .find(groupColumn.getChild(_).getName == colName)

Review Comment:
   I'm not sure I understand the concern, or how you'd like it changed. Can you clarify? I'd like to basically do the same search for "value" and "metadata" that is done for struct fields in `convertInternal`, and then do validation on each one. It's not clear to me what the easier approach would be.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1538074039


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +193,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {
+    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
+    // it invalid to read.
+    // Test cases:
+    // 1) A binary that has the correct parquet structure for Variant.
+    // 2) A binary that has the correct parquet structure in unexpected order. This is valid.
+    // 3) A binary that is almost correct, but contains an extra field "paths"
+    // 4,5) A binary with incorrect field names
+    // 6) Incorrect data typea
+    // 7,8) Nullable value or metdata
+
+    // Binary value of empty metadata
+    val m = "X'010000'"
+    // Binary value of a literal "false"
+    val v = "X'8'"
+    val cases = Seq(
+        (true, s"named_struct('value', $v, 'metadata', $m)"),
+        (true, s"named_struct('metadata', $m, 'value', $v)"),
+        (false, s"named_struct('value', $v, 'metadata', $m, 'paths', $v)"),
+        (false, s"named_struct('value', $v, 'dictionary', $m)"),
+        (false, s"named_struct('val', $v, 'metadata', $m)"),
+        (false, s"named_struct('value', 8, 'metadata', $m)"),
+        (false, s"named_struct('value', cast(null as binary), 'metadata', $m)"),
+        (false, s"named_struct('value', $v, 'metadata', cast(null as binary))")
+    )
+    cases.foreach { case (expectValid, structDef) =>
+      withTempDir { dir =>
+        val file = new File(dir, "dir").getCanonicalPath
+        val df = spark.sql(s"select $structDef as v from range(10)")
+        df.write.parquet(file)
+        val schema = StructType(Seq(StructField("v", VariantType)))
+        val result = spark.read.schema(schema).parquet(file)
+            .selectExpr("to_json(v)")
+        if (expectValid) {
+          checkAnswer(result, Seq.fill(10)(Row("false")))
+        } else {
+          val e = intercept[org.apache.spark.SparkException](result.collect())
+          assert(e.getCause.isInstanceOf[AnalysisException] ||

Review Comment:
   I just refactored a bit so that we only throw AnalysisException, not SchemaColumnConvertNotSupportedException. We'll either throw QueryCompilationErrors.illegalParquetTypeError or QueryCompilationErrors.parquetTypeUnsupportedYetError.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539561237


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +192,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {
+    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
+    // it invalid to read.
+    // Test cases:
+    // 1) A binary that has the correct parquet structure for Variant.
+    // 2) A binary that has the correct parquet structure in unexpected order. This is valid.

Review Comment:
   To @cashmand , if you want to keep, please make a new test case for that positive case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546] Improve validation and add tests. [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1537853307


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -179,9 +179,21 @@ class ParquetToSparkSchemaConverter(
     field match {
       case primitiveColumn: PrimitiveColumnIO => convertPrimitiveField(primitiveColumn, targetType)
       case groupColumn: GroupColumnIO if targetType.contains(VariantType) =>
+        if (groupColumn.getChildrenCount != 2 ) {
+          // We may allow more than two children in the future, so consider this unsupported.
+          throw QueryCompilationErrors.
+            parquetTypeUnsupportedYetError("variant with more than two fields")
+        }
+        val valueIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "value")
+        val metadataIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "metadata")

Review Comment:
   I was thinking of our storage model of Variant as being a `struct-of-binary`, and since structs are not sensitive to name order, these fields also shouldn't be.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539455046


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -404,6 +401,35 @@ class ParquetToSparkSchemaConverter(
     }
   }
 
+  private def convertVariantField(groupColumn: GroupColumnIO): ParquetColumn = {
+    if (groupColumn.getChildrenCount != 2 ) {
+      // We may allow more than two children in the future, so consider this unsupported.
+      throw QueryCompilationErrors.
+        parquetTypeUnsupportedYetError("variant with more than two fields")
+    }
+    // Find the binary columns, and validate that they have the correct type.
+    val valueAndMetadata = Seq("value", "metadata").map { colName =>
+      val idx = (0 until groupColumn.getChildrenCount)
+          .find(groupColumn.getChild(_).getName == colName)
+      if (idx.isEmpty) {
+        throw QueryCompilationErrors.illegalParquetTypeError(s"variant missing $colName field")
+      }
+      val child = groupColumn.getChild(idx.get)
+      // The value and metadata cannot be individually null, only the full struct can.
+      if (child.getType.getRepetition != REQUIRED ||
+        !child.isInstanceOf[PrimitiveColumnIO] ||

Review Comment:
   nit. Indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539476481


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -404,6 +401,35 @@ class ParquetToSparkSchemaConverter(
     }
   }
 
+  private def convertVariantField(groupColumn: GroupColumnIO): ParquetColumn = {
+    if (groupColumn.getChildrenCount != 2 ) {
+      // We may allow more than two children in the future, so consider this unsupported.
+      throw QueryCompilationErrors.
+        parquetTypeUnsupportedYetError("variant with more than two fields")
+    }
+    // Find the binary columns, and validate that they have the correct type.
+    val valueAndMetadata = Seq("value", "metadata").map { colName =>
+      val idx = (0 until groupColumn.getChildrenCount)
+          .find(groupColumn.getChild(_).getName == colName)

Review Comment:
   Just a question. Although this is supposed to be light procedure, do we need to use iteration loops like the above (line 411 ~ 413)? I'm just wondering if we can do more easier than this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on PR #45703:
URL: https://github.com/apache/spark/pull/45703#issuecomment-2018438251

   Hi @cloud-fan and @dongjoon-hyun,
   
   I updated the title, and made a few more changes:
   1) Refactored the Variant checks into a separate function, similar to what already exists for scalar and struct types.
   2) Added a few more test cases.
   3) Added a check that the value and metadata fields in parquet are non-nullable. This is also one that is a bit up for debate - without this check, we should fail somewhere downstream with a null pointer exception if we encounter a null value, but it seems clearer to enforce it at the parquet metadata level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539448653


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -404,6 +401,35 @@ class ParquetToSparkSchemaConverter(
     }
   }
 
+  private def convertVariantField(groupColumn: GroupColumnIO): ParquetColumn = {
+    if (groupColumn.getChildrenCount != 2 ) {

Review Comment:
   nit. `2 )` -> `2)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539468054


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +192,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {
+    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
+    // it invalid to read.
+    // Test cases:
+    // 1) A binary that has the correct parquet structure for Variant.
+    // 2) A binary that has the correct parquet structure in unexpected order. This is valid.

Review Comment:
   Please remove (1) and (2) from this test case. It doesn't match with the test case name and the above description.
   
   >    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
   >    // it invalid to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546] Improve validation and add tests. [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1537853307


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -179,9 +179,21 @@ class ParquetToSparkSchemaConverter(
     field match {
       case primitiveColumn: PrimitiveColumnIO => convertPrimitiveField(primitiveColumn, targetType)
       case groupColumn: GroupColumnIO if targetType.contains(VariantType) =>
+        if (groupColumn.getChildrenCount != 2 ) {
+          // We may allow more than two children in the future, so consider this unsupported.
+          throw QueryCompilationErrors.
+            parquetTypeUnsupportedYetError("variant with more than two fields")
+        }
+        val valueIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "value")
+        val metadataIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "metadata")

Review Comment:
   I was thinking of our storage model of Variant as being a "struct-of-binary", and since structs are not sensitive to name order, these fields also shouldn't be.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539459628


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +192,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {

Review Comment:
   nit.
   ```
   test("invalid variant binary") {
   test("SPARK-47546: invalid variant binary") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on PR #45703:
URL: https://github.com/apache/spark/pull/45703#issuecomment-2024035501

   @cloud-fan @dongjoon-hyun 
   
   It looks like the checks I added are only effective for the vectorized reader, and I probably need to add similar checks for parquet-mr (and modify my unit test to test with and without parquet-mr). I think it's fine to merge this PR and make the parquet-mr fixes in a follow-up, but if you prefer, I can work on adding the checks for parquet-mr to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #45703: [SPARK-47546][SQL] Improve validation when reading Variant from Parquet
URL: https://github.com/apache/spark/pull/45703


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546] Improve validation and add tests. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1537790257


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -179,9 +179,21 @@ class ParquetToSparkSchemaConverter(
     field match {
       case primitiveColumn: PrimitiveColumnIO => convertPrimitiveField(primitiveColumn, targetType)
       case groupColumn: GroupColumnIO if targetType.contains(VariantType) =>
+        if (groupColumn.getChildrenCount != 2 ) {
+          // We may allow more than two children in the future, so consider this unsupported.
+          throw QueryCompilationErrors.
+            parquetTypeUnsupportedYetError("variant with more than two fields")
+        }
+        val valueIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "value")
+        val metadataIdx = (0 until groupColumn.getChildrenCount)
+            .find(groupColumn.getChild(_).getName == "metadata")

Review Comment:
   do we need to require the ordering? e.g. value first, then metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1537898738


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +193,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {
+    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
+    // it invalid to read.
+    // Test cases:
+    // 1) A binary that has the correct parquet structure for Variant.
+    // 2) A binary that has the correct parquet structure in unexpected order. This is valid.
+    // 3) A binary that is almost correct, but contains an extra field "paths"
+    // 4,5) A binary with incorrect field names
+    // 6) Incorrect data typea
+    // 7,8) Nullable value or metdata
+
+    // Binary value of empty metadata
+    val m = "X'010000'"
+    // Binary value of a literal "false"
+    val v = "X'8'"
+    val cases = Seq(
+        (true, s"named_struct('value', $v, 'metadata', $m)"),
+        (true, s"named_struct('metadata', $m, 'value', $v)"),
+        (false, s"named_struct('value', $v, 'metadata', $m, 'paths', $v)"),
+        (false, s"named_struct('value', $v, 'dictionary', $m)"),
+        (false, s"named_struct('val', $v, 'metadata', $m)"),
+        (false, s"named_struct('value', 8, 'metadata', $m)"),
+        (false, s"named_struct('value', cast(null as binary), 'metadata', $m)"),
+        (false, s"named_struct('value', $v, 'metadata', cast(null as binary))")
+    )
+    cases.foreach { case (expectValid, structDef) =>
+      withTempDir { dir =>
+        val file = new File(dir, "dir").getCanonicalPath
+        val df = spark.sql(s"select $structDef as v from range(10)")
+        df.write.parquet(file)
+        val schema = StructType(Seq(StructField("v", VariantType)))
+        val result = spark.read.schema(schema).parquet(file)
+            .selectExpr("to_json(v)")
+        if (expectValid) {
+          checkAnswer(result, Seq.fill(10)(Row("false")))
+        } else {
+          val e = intercept[org.apache.spark.SparkException](result.collect())
+          assert(e.getCause.isInstanceOf[AnalysisException] ||

Review Comment:
   when will we throw AnalysisException?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47546][SQL] Improve validation when reading Variant from Parquet [spark]

Posted by "cashmand (via GitHub)" <gi...@apache.org>.

cashmand commented on code in PR #45703:
URL: https://github.com/apache/spark/pull/45703#discussion_r1539484303


##########
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##########
@@ -192,4 +192,48 @@ class VariantSuite extends QueryTest with SharedSparkSession {
       }
     }
   }
+
+  test("invalid variant binary") {
+    // Write a struct-of-binary that looks like a Variant, but with minor variations that may make
+    // it invalid to read.
+    // Test cases:
+    // 1) A binary that has the correct parquet structure for Variant.
+    // 2) A binary that has the correct parquet structure in unexpected order. This is valid.

Review Comment:
   I'd like to leave them in some form, as as sanity check that we can successfully read the values when they're in the right format, since manually constructing these binary values isn't the normal way to create a Variant, and I want to make sure that the other test cases are failing because of the variations from this working case, and not because the working case itself is broken for some reason.
   
   I could add a comment to clarify this, or move these two into a separate test. Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org