You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "shujingyang-db (via GitHub)" <gi...@apache.org> on 2024/03/15 20:55:46 UTC

[PR] [SPARK-47309][SQL] XML: Add schema inference tests for value tags [spark]

shujingyang-db opened a new pull request, #45538:
URL: https://github.com/apache/spark/pull/45538

### What changes were proposed in this pull request?

Add schema inference tags for corrupt records, null values and value tags. For value tags, this PR adds the following tests:
1. Conflict between primitive types conflict
2. Root-level value tag
3. empty value tag in some rows
4. array of value tags:
1) values split into multiple lines
2) interspersed in nested structs: empty fields and optional fields in structs
3) interspersed in arrays and value tags: empty fields and optional fields in structs
4) name conflict
5) CDATA and comments
6) no spaces / some spaces / whitespaces between valueTags and elements

### Why are the changes needed?

This is a test-only change.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a test-only change.

### Was this patch authored or co-authored using generative AI tooling?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47309][SQL] XML: Add schema inference tests for value tags [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45538:
URL: https://github.com/apache/spark/pull/45538#issuecomment-2008486564

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47309][SQL] XML: Add schema inference tests for value tags [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45538: [SPARK-47309][SQL] XML: Add schema inference tests for value tags
URL: https://github.com/apache/spark/pull/45538


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47309][SQL] XML: Add schema inference tests for value tags [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #45538:
URL: https://github.com/apache/spark/pull/45538#discussion_r1529515496


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlInferSchemaSuite.scala:
##########
@@ -293,4 +305,323 @@ class XmlInferSchemaSuite extends QueryTest with SharedSparkSession with TestXml
     assert(emptyDF.schema === expectedSchema)
   }
 
+  test("nulls in arrays") {
+    val expectedSchema = StructType(
+      StructField(
+        "field1",
+        ArrayType(
+          new StructType()
+            .add("array1", ArrayType(new StructType().add("array2", ArrayType(StringType))))
+        )
+      ) ::
+      StructField(
+        "field2",
+        ArrayType(
+          new StructType()
+            .add("array1", ArrayType(StructType(StructField("Test", LongType) :: Nil)))
+        )
+      ) :: Nil
+    )
+    val expectedAns = Seq(
+      Row(Seq(Row(Seq(Row(Seq("value1", "value2")), Row(null))), Row(null)), null),
+      Row(null, Seq(Row(null), Row(Seq(Row(1), Row(null))))),
+      Row(Seq(Row(null), Row(Seq(Row(null)))), Seq(Row(null)))
+    )
+    val xmlDF = readData(nullsInArrays)
+    assert(xmlDF.schema === expectedSchema)
+    checkAnswer(xmlDF, expectedAns)
+  }
+
+  test("corrupt records: fail fast mode") {
+    // fail fast mode is covered in the testcase: DSL test for failing fast in XmlSuite
+    val schemaOne = StructType(
+      StructField("a", StringType, true) ::
+      StructField("b", StringType, true) ::
+      StructField("c", StringType, true) :: Nil
+    )
+    // `DROPMALFORMED` mode should skip corrupt records
+    val xmlDFOne = readData(corruptRecords, Map("mode" -> "DROPMALFORMED"))
+    checkAnswer(
+      xmlDFOne,
+      Row("1", "2", null) ::
+      Row("str_a_4", "str_b_4", "str_c_4") :: Nil
+    )
+    assert(xmlDFOne.schema === schemaOne)
+  }
+
+  test("turn non-nullable schema into a nullable schema") {
+    // XML field is missing.
+    val missingFieldInput = """<ROW><c1>1</c1></ROW>"""
+    val missingFieldInputDS =
+      spark.createDataset(spark.sparkContext.parallelize(missingFieldInput :: Nil))(Encoders.STRING)
+    // XML filed is null.
+    val nullValueInput = """<ROW><c1>1</c1><c2/></ROW>"""
+    val nullValueInputDS =
+      spark.createDataset(spark.sparkContext.parallelize(nullValueInput :: Nil))(Encoders.STRING)
+
+    val schema = StructType(
+      Seq(
+        StructField("c1", IntegerType, nullable = false),
+        StructField("c2", IntegerType, nullable = false)
+      )
+    )
+    val expected = schema.asNullable
+
+    Seq(missingFieldInputDS, nullValueInputDS).foreach { xmlStringDS =>
+      Seq("DROPMALFORMED", "FAILFAST", "PERMISSIVE").foreach { mode =>
+        val df = spark.read
+          .option("mode", mode)
+          .option("rowTag", "ROW")
+          .schema(schema)
+          .xml(xmlStringDS)
+        assert(df.schema == expected)
+        checkAnswer(df, Row(1, null) :: Nil)
+      }
+      withSQLConf(SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> "true") {
+        checkAnswer(
+          spark.read
+            .schema(
+              StructType(
+                StructField("c1", LongType, nullable = false) ::
+                StructField("c2", LongType, nullable = false) :: Nil
+              )
+            )
+            .option("rowTag", "ROW")
+            .option("mode", "DROPMALFORMED")
+            .xml(xmlStringDS),
+          // It is for testing legacy configuration. This is technically a bug as
+          // `0` has to be `null` but the schema is non-nullable.
+          Row(1, 0)
+        )
+      }
+    }
+  }
+
+  test("XML with partitions") {
+    def makePartition(rdd: RDD[String], parent: File, partName: String, partValue: Any): File = {
+      val p = new File(parent, s"$partName=${partValue.toString}")
+      rdd.saveAsTextFile(p.getCanonicalPath)
+      p
+    }
+
+    withTempPath(root => {
+      withTempView("test_myxml_with_part") {
+        val d1 = new File(root, "d1=1")
+        // root/dt=1/col1=abc
+        makePartition(
+          sparkContext.parallelize(2 to 5).map(i => s"""<ROW><a>1</a><b>str$i</b></ROW>"""),
+          d1,
+          "col1",
+          "abc"
+        )
+
+        // root/dt=1/col1=abd

Review Comment:
   nit:
   ```suggestion
           // root/d1=1/col1=abd
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlInferSchemaSuite.scala:
##########
@@ -293,4 +305,323 @@ class XmlInferSchemaSuite extends QueryTest with SharedSparkSession with TestXml
     assert(emptyDF.schema === expectedSchema)
   }
 
+  test("nulls in arrays") {
+    val expectedSchema = StructType(
+      StructField(
+        "field1",
+        ArrayType(
+          new StructType()
+            .add("array1", ArrayType(new StructType().add("array2", ArrayType(StringType))))
+        )
+      ) ::
+      StructField(
+        "field2",
+        ArrayType(
+          new StructType()
+            .add("array1", ArrayType(StructType(StructField("Test", LongType) :: Nil)))
+        )
+      ) :: Nil
+    )
+    val expectedAns = Seq(
+      Row(Seq(Row(Seq(Row(Seq("value1", "value2")), Row(null))), Row(null)), null),
+      Row(null, Seq(Row(null), Row(Seq(Row(1), Row(null))))),
+      Row(Seq(Row(null), Row(Seq(Row(null)))), Seq(Row(null)))
+    )
+    val xmlDF = readData(nullsInArrays)
+    assert(xmlDF.schema === expectedSchema)
+    checkAnswer(xmlDF, expectedAns)
+  }
+
+  test("corrupt records: fail fast mode") {
+    // fail fast mode is covered in the testcase: DSL test for failing fast in XmlSuite
+    val schemaOne = StructType(
+      StructField("a", StringType, true) ::
+      StructField("b", StringType, true) ::
+      StructField("c", StringType, true) :: Nil
+    )
+    // `DROPMALFORMED` mode should skip corrupt records
+    val xmlDFOne = readData(corruptRecords, Map("mode" -> "DROPMALFORMED"))
+    checkAnswer(
+      xmlDFOne,
+      Row("1", "2", null) ::
+      Row("str_a_4", "str_b_4", "str_c_4") :: Nil
+    )
+    assert(xmlDFOne.schema === schemaOne)
+  }
+
+  test("turn non-nullable schema into a nullable schema") {
+    // XML field is missing.
+    val missingFieldInput = """<ROW><c1>1</c1></ROW>"""
+    val missingFieldInputDS =
+      spark.createDataset(spark.sparkContext.parallelize(missingFieldInput :: Nil))(Encoders.STRING)
+    // XML filed is null.
+    val nullValueInput = """<ROW><c1>1</c1><c2/></ROW>"""
+    val nullValueInputDS =
+      spark.createDataset(spark.sparkContext.parallelize(nullValueInput :: Nil))(Encoders.STRING)
+
+    val schema = StructType(
+      Seq(
+        StructField("c1", IntegerType, nullable = false),
+        StructField("c2", IntegerType, nullable = false)
+      )
+    )
+    val expected = schema.asNullable
+
+    Seq(missingFieldInputDS, nullValueInputDS).foreach { xmlStringDS =>
+      Seq("DROPMALFORMED", "FAILFAST", "PERMISSIVE").foreach { mode =>
+        val df = spark.read
+          .option("mode", mode)
+          .option("rowTag", "ROW")
+          .schema(schema)
+          .xml(xmlStringDS)
+        assert(df.schema == expected)
+        checkAnswer(df, Row(1, null) :: Nil)
+      }
+      withSQLConf(SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> "true") {
+        checkAnswer(
+          spark.read
+            .schema(
+              StructType(
+                StructField("c1", LongType, nullable = false) ::
+                StructField("c2", LongType, nullable = false) :: Nil
+              )
+            )
+            .option("rowTag", "ROW")
+            .option("mode", "DROPMALFORMED")
+            .xml(xmlStringDS),
+          // It is for testing legacy configuration. This is technically a bug as
+          // `0` has to be `null` but the schema is non-nullable.
+          Row(1, 0)
+        )
+      }
+    }
+  }
+
+  test("XML with partitions") {
+    def makePartition(rdd: RDD[String], parent: File, partName: String, partValue: Any): File = {
+      val p = new File(parent, s"$partName=${partValue.toString}")
+      rdd.saveAsTextFile(p.getCanonicalPath)
+      p
+    }
+
+    withTempPath(root => {
+      withTempView("test_myxml_with_part") {
+        val d1 = new File(root, "d1=1")
+        // root/dt=1/col1=abc
+        makePartition(
+          sparkContext.parallelize(2 to 5).map(i => s"""<ROW><a>1</a><b>str$i</b></ROW>"""),
+          d1,
+          "col1",
+          "abc"
+        )
+
+        // root/dt=1/col1=abd
+        makePartition(
+          sparkContext.parallelize(6 to 10).map(i => s"""<ROW><a>1</a><c>str$i</c></ROW>"""),
+          d1,
+          "col1",
+          "abd"
+        )
+        val expectedSchema = new StructType()
+          .add("a", LongType)
+          .add("b", StringType)
+          .add("c", StringType)
+          .add("d1", IntegerType)
+          .add("col1", StringType)
+
+        val df = spark.read.option("rowTag", "ROW").xml(root.getAbsolutePath)
+        assert(df.schema === expectedSchema)
+        assert(df.where(col("d1") === 1).where(col("col1") === "abc").select("a").count() == 4)
+        assert(df.where(col("d1") === 1).where(col("col1") === "abd").select("a").count() == 5)
+        assert(df.where(col("d1") === 1).select("a").count() == 9)
+      }
+    })
+  }
+
+  test("value tag - type conflict and root level value tags") {
+    val xmlDF = readData(valueTagsTypeConflict, ignoreSurroundingSpacesOptions)
+    val expectedSchema = new StructType()
+      .add(valueTagName, ArrayType(StringType))
+      .add(
+        "a",
+        new StructType()
+          .add(valueTagName, LongType)
+          .add("b", new StructType().add(valueTagName, StringType).add("c", LongType))
+      )
+    assert(xmlDF.schema == expectedSchema)
+    val expectedAns = Seq(
+      Row(Seq("13.1", "string"), Row(11, Row("true", 1))),
+      Row(Seq("string", "true"), Row(21474836470L, Row("false", 2))),
+      Row(Seq("92233720368547758070"), Row(null, Row("12", 3)))
+    )
+    checkAnswer(xmlDF, expectedAns)
+  }
+
+  test("value tag - spaces and empty values") {
+    val expectedSchema = new StructType()
+      .add(valueTagName, ArrayType(StringType))
+      .add("a", new StructType().add(valueTagName, StringType).add("b", LongType))
+    // even though we don't ignore the surrounding spaces of characters,
+    // we won't put whitespaces as value tags :)
+    val xmlDFWSpaces =
+      readData(emptyValueTags, notIgnoreSurroundingSpacesOptions)
+    val xmlDFWOSpaces = readData(emptyValueTags, ignoreSurroundingSpacesOptions)
+    assert(xmlDFWSpaces.schema == expectedSchema)
+    assert(xmlDFWOSpaces.schema == expectedSchema)
+
+    val expectedAnsWSpaces = Seq(
+      Row(Seq("\n    str1\n    ", "str2\n"), Row(null, 1)),
+      Row(null, Row(" value", null)),
+      Row(null, Row(null, 3)),
+      Row(Seq("\n    str3\n"), Row(null, 4))
+    )
+    checkAnswer(xmlDFWSpaces, expectedAnsWSpaces)
+    val expectedAnsWOSpaces = Seq(
+      Row(Seq("str1", "str2"), Row(null, 1)),
+      Row(null, Row("value", null)),
+      Row(null, Row(null, 3)),
+      Row(Seq("str3"), Row(null, 4))
+    )
+    checkAnswer(xmlDFWOSpaces, expectedAnsWOSpaces)
+  }
+
+  test("value tags - multiple lines") {
+    val xmlDF = readData(multilineValueTags, ignoreSurroundingSpacesOptions)
+    val expectedSchema =
+      new StructType().add(valueTagName, ArrayType(StringType)).add("a", LongType)
+    val expectedAns = Seq(
+      Row(Seq("value1", "value2"), 1),
+      Row(Seq("value3\n    value4"), 1)
+    )
+    assert(xmlDF.schema == expectedSchema)
+    checkAnswer(xmlDF, expectedAns)
+  }
+
+  test("value tags - around structs") {
+    val xmlDF = readData(valueTagsAroundStructs)
+    val expectedSchema = new StructType()
+      .add(valueTagName, ArrayType(StringType))
+      .add(
+        "a",
+        new StructType()
+          .add(valueTagName, ArrayType(StringType))
+          .add("b", new StructType().add(valueTagName, LongType).add("c", LongType))
+      )
+
+    assert(xmlDF.schema == expectedSchema)
+    val expectedAns = Seq(
+      Row(
+        Seq("value1", "value5"),
+        Row(Seq("value2", "value4"), Row(3, 1))
+      ),
+      Row(
+        Seq("value6"),
+        Row(Seq("value4", "value5"), Row(null, null))
+      ),
+      Row(
+        Seq("value1", "value5"),
+        Row(Seq("value2", "value4"), Row(3, null))
+      ),
+      Row(
+        Seq("value1"),
+        Row(Seq("value2", "value4"), Row(3, null))
+      )
+    )
+    checkAnswer(xmlDF, expectedAns)
+  }
+
+  test("value tags - around arrays") {
+    val xmlDF = readData(valueTagsAroundArrays)
+    val expectedSchema = new StructType()
+      .add(valueTagName, ArrayType(StringType))
+      .add(
+        "array1",
+        ArrayType(
+          new StructType()
+            .add(valueTagName, ArrayType(StringType))
+            .add(
+              "array2",
+              ArrayType(new StructType()
+                // The value tag is not of long type due to:
+                // When determining
+                .add(valueTagName, ArrayType(StringType))

Review Comment:
   Update comment.. why it is not LongType?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlInferSchemaSuite.scala:
##########
@@ -293,4 +305,323 @@ class XmlInferSchemaSuite extends QueryTest with SharedSparkSession with TestXml
     assert(emptyDF.schema === expectedSchema)
   }
 
+  test("nulls in arrays") {
+    val expectedSchema = StructType(
+      StructField(
+        "field1",
+        ArrayType(
+          new StructType()
+            .add("array1", ArrayType(new StructType().add("array2", ArrayType(StringType))))
+        )
+      ) ::
+      StructField(
+        "field2",
+        ArrayType(
+          new StructType()
+            .add("array1", ArrayType(StructType(StructField("Test", LongType) :: Nil)))
+        )
+      ) :: Nil
+    )
+    val expectedAns = Seq(
+      Row(Seq(Row(Seq(Row(Seq("value1", "value2")), Row(null))), Row(null)), null),
+      Row(null, Seq(Row(null), Row(Seq(Row(1), Row(null))))),
+      Row(Seq(Row(null), Row(Seq(Row(null)))), Seq(Row(null)))
+    )
+    val xmlDF = readData(nullsInArrays)
+    assert(xmlDF.schema === expectedSchema)
+    checkAnswer(xmlDF, expectedAns)
+  }
+
+  test("corrupt records: fail fast mode") {
+    // fail fast mode is covered in the testcase: DSL test for failing fast in XmlSuite
+    val schemaOne = StructType(
+      StructField("a", StringType, true) ::
+      StructField("b", StringType, true) ::
+      StructField("c", StringType, true) :: Nil
+    )
+    // `DROPMALFORMED` mode should skip corrupt records
+    val xmlDFOne = readData(corruptRecords, Map("mode" -> "DROPMALFORMED"))
+    checkAnswer(
+      xmlDFOne,
+      Row("1", "2", null) ::
+      Row("str_a_4", "str_b_4", "str_c_4") :: Nil
+    )
+    assert(xmlDFOne.schema === schemaOne)
+  }
+
+  test("turn non-nullable schema into a nullable schema") {
+    // XML field is missing.
+    val missingFieldInput = """<ROW><c1>1</c1></ROW>"""
+    val missingFieldInputDS =
+      spark.createDataset(spark.sparkContext.parallelize(missingFieldInput :: Nil))(Encoders.STRING)
+    // XML filed is null.
+    val nullValueInput = """<ROW><c1>1</c1><c2/></ROW>"""
+    val nullValueInputDS =
+      spark.createDataset(spark.sparkContext.parallelize(nullValueInput :: Nil))(Encoders.STRING)
+
+    val schema = StructType(
+      Seq(
+        StructField("c1", IntegerType, nullable = false),
+        StructField("c2", IntegerType, nullable = false)
+      )
+    )
+    val expected = schema.asNullable
+
+    Seq(missingFieldInputDS, nullValueInputDS).foreach { xmlStringDS =>
+      Seq("DROPMALFORMED", "FAILFAST", "PERMISSIVE").foreach { mode =>
+        val df = spark.read
+          .option("mode", mode)
+          .option("rowTag", "ROW")
+          .schema(schema)
+          .xml(xmlStringDS)
+        assert(df.schema == expected)
+        checkAnswer(df, Row(1, null) :: Nil)
+      }
+      withSQLConf(SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> "true") {
+        checkAnswer(
+          spark.read
+            .schema(
+              StructType(
+                StructField("c1", LongType, nullable = false) ::
+                StructField("c2", LongType, nullable = false) :: Nil
+              )
+            )
+            .option("rowTag", "ROW")
+            .option("mode", "DROPMALFORMED")
+            .xml(xmlStringDS),
+          // It is for testing legacy configuration. This is technically a bug as
+          // `0` has to be `null` but the schema is non-nullable.
+          Row(1, 0)
+        )
+      }
+    }
+  }
+
+  test("XML with partitions") {
+    def makePartition(rdd: RDD[String], parent: File, partName: String, partValue: Any): File = {
+      val p = new File(parent, s"$partName=${partValue.toString}")
+      rdd.saveAsTextFile(p.getCanonicalPath)
+      p
+    }
+
+    withTempPath(root => {
+      withTempView("test_myxml_with_part") {
+        val d1 = new File(root, "d1=1")
+        // root/dt=1/col1=abc

Review Comment:
   nit:
   ```suggestion
           // root/d1=1/col1=abc
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47309][SQL] XML: Add schema inference tests for value tags [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45538:
URL: https://github.com/apache/spark/pull/45538#issuecomment-2008486755

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org