You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "shujingyang-db (via GitHub)" <gi...@apache.org> on 2024/01/03 08:01:31 UTC

[PR] [SPARK-46382][SQL] XML: Capture values interspersed between elements follow-up [spark]

shujingyang-db opened a new pull request, #44571:
URL: https://github.com/apache/spark/pull/44571

### What changes were proposed in this pull request?

This follow-up simplifies the handling of value tags.

As value tags only exist in structure data, their handling will be confined to the inferObject method, eliminating the need for processing in inferField. This implies that when we encounter non-whitespace characters, we can invoke inferObject. For structures with a single primitive field, we'll simplify them into primitive types.
The inferAndCheckEndElement function is updated to align with this approach.

### Why are the changes needed?

This follow-up simplifies the handling of value tags.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443187199


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            StringType
           case _ => inferField(parser)
         }
       case c: Characters if !c.isWhiteSpace =>
-        val characterType = inferFrom(c.getData)
-        parser.nextEvent()
-        parser.peek match {
-          case _: StartElement =>
-            // Some more elements follow;
-            // This is a mix of values and other elements
-            val innerType = inferObject(parser).asInstanceOf[StructType]
-            addOrUpdateValueTagType(innerType, characterType)
-          case _ =>
-            val fieldType = inferField(parser)
-            fieldType match {
-              case st: StructType => addOrUpdateValueTagType(st, characterType)
-              case _: NullType => characterType
-              case _: DataType =>
-                // The field type couldn't be an array type
-                new StructType()
-                .add(options.valueTag, addOrUpdateType(Some(characterType), fieldType))
-
-            }
+        val structType = inferObject(parser).asInstanceOf[StructType]
+        structType match {
+          case simpleType
+              if structType.fields.length == 1
+              && isPrimitiveType(structType.fields.head.dataType)
+              && isValueTagField(structType.fields.head) =>
+            simpleType.fields.head.dataType
+          case _ => structType

Review Comment:
   👍 Added `case _ if structType.fields.isEmpty =>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442428458


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   ```suggestion
           parser.next
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        null
       case (c: Characters, ArrayType(st, _)) =>
         // For `ArrayType`, it needs to return the type of element. The values are merged later.
         parser.next
-        convertTo(c.getData, st)
-      case (c: Characters, st: StructType) =>
-        parser.next
-        parser.peek match {
-          case _: EndElement =>
-            // It couldn't be an array of value tags
-            // as the opening tag is immediately followed by a closing tag.
-            if (c.isWhiteSpace) {
-              return null
-            }
-            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
-            indexOpt match {
-              case Some(index) =>
-                convertTo(c.getData, st.fields(index).dataType)
-              case None => null
-            }
-          case _ =>
-            val row = convertObject(parser, st)
-            if (!c.isWhiteSpace) {
-              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
-            } else {
-              row
-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   The check in this function will fail if the next event is Comment, Cdata, etc.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   Why not skip EndElement in skipChildren itself?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -151,6 +151,7 @@ object StaxXmlParserUtils {
         indent > 0
       case _ => true
     })
+    consumeNextEndElement(parser)

Review Comment:
   ```suggestion
           if (indent <= 0) parser.next
           indent > 0
         case _ => true
       })
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   ```suggestion
           parser.next
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        null
       case (c: Characters, ArrayType(st, _)) =>
         // For `ArrayType`, it needs to return the type of element. The values are merged later.
         parser.next
-        convertTo(c.getData, st)
-      case (c: Characters, st: StructType) =>
-        parser.next
-        parser.peek match {
-          case _: EndElement =>
-            // It couldn't be an array of value tags
-            // as the opening tag is immediately followed by a closing tag.
-            if (c.isWhiteSpace) {
-              return null
-            }
-            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
-            indexOpt match {
-              case Some(index) =>
-                convertTo(c.getData, st.fields(index).dataType)
-              case None => null
-            }
-          case _ =>
-            val row = convertObject(parser, st)
-            if (!c.isWhiteSpace) {
-              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
-            } else {
-              row
-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        value
+      case (_: Characters, st: StructType) =>
+        convertObject(parser, st)
       case (_: Characters, _: StringType) =>
         convertTo(StaxXmlParserUtils.currentStructureAsString(parser), StringType)
       case (c: Characters, _: DataType) if c.isWhiteSpace =>
         // When `Characters` is found, we need to look further to decide
         // if this is really data or space between other elements.
-        val data = c.getData
         parser.next
-        parser.peek match {
-          case _: StartElement => convertComplicatedType(dataType, attributes)
-          case _: EndElement if data.isEmpty => null
-          case _: EndElement => convertTo(data, dataType)
-          case _ => convertField(parser, dataType, attributes)
-        }
+        convertField(parser, dataType, attributes)
       case (c: Characters, dt: DataType) =>
+        val value = convertTo(c.getData, dt)
         parser.next
-        convertTo(c.getData, dt)
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   The check in this function will fail if the next event is Comment, Cdata, etc.
   `<ROW><foo attr="a">1<!--This is a comment--></foo></ROW>`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)
               }
           }
         } catch {
           case e: SparkUpgradeException => throw e
           case NonFatal(e) =>
             badRecordException = badRecordException.orElse(Some(e))
+            StaxXmlParserUtils.skipChildren(parser)
+            StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   This change is not needed. 
   This may be a parsing exception. Parsing any further will likely result in more exception. And `badRecordException ` will be lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1440997050


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -235,22 +227,6 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
     val nameToDataType =
       collection.mutable.TreeMap.empty[String, DataType](caseSensitivityOrdering)
 

Review Comment:
   Define `firstEventIsStartElement` here:
   
   `val firstEventIsStartElement = parser.peek().isStartElement`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443381162


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +253,15 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag

Review Comment:
   ```suggestion
             // This is a value tag
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -603,4 +518,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         newType
     }
   }
+
+  private[xml] def isPrimitiveType(dataType: DataType): Boolean = {
+    dataType match {
+      case _: StructType => false
+      case _: ArrayType => false
+      case _: MapType => false
+      case _ => true
+    }
+  }
+
+  private[xml] def isValueTagField(structField: StructField): Boolean = {
+    structField.name.toLowerCase(Locale.ROOT) == options.valueTag.toLowerCase(Locale.ROOT)

Review Comment:
   Shouldn't this be based on `caseSensitive` option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44571:
URL: https://github.com/apache/spark/pull/44571#issuecomment-1879900195

   https://github.com/apache/spark/pull/44601 had a conflict with this PR. would you mind resolving the conflicts please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443292272


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->
          |                </struct3>
          |                <string>string</string>
          |                value12

Review Comment:
   Added 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44571:
URL: https://github.com/apache/spark/pull/44571#issuecomment-1879899986

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442456803


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        null
       case (c: Characters, ArrayType(st, _)) =>
         // For `ArrayType`, it needs to return the type of element. The values are merged later.
         parser.next
-        convertTo(c.getData, st)
-      case (c: Characters, st: StructType) =>
-        parser.next
-        parser.peek match {
-          case _: EndElement =>
-            // It couldn't be an array of value tags
-            // as the opening tag is immediately followed by a closing tag.
-            if (c.isWhiteSpace) {
-              return null
-            }
-            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
-            indexOpt match {
-              case Some(index) =>
-                convertTo(c.getData, st.fields(index).dataType)
-              case None => null
-            }
-          case _ =>
-            val row = convertObject(parser, st)
-            if (!c.isWhiteSpace) {
-              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
-            } else {
-              row
-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   Got you! I guess if the next event is Comment, Cdata, etc., we wanted to skip those events as well and also the end element.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443324753


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {

Review Comment:
   Added an assertion on the endElementName



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443212881


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   You are right. We are filtering comments [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala#L43). Do we need special handling for CDATA or EntityReferences?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44571:
URL: https://github.com/apache/spark/pull/44571#issuecomment-1880225723

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1440997050


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -235,22 +227,6 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
     val nameToDataType =
       collection.mutable.TreeMap.empty[String, DataType](caseSensitivityOrdering)
 

Review Comment:
   Define `firstEventIsStartElement` here:
   
   `val firstEventIsStartElement = parser.peek().isStartElement`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +265,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
-        case _: EndElement =>
-          shouldStop = inferAndCheckEndElement(parser)
+        case e: EndElement =>
+          // In case of corrupt records, we shouldn't read beyond the EndDocument
+          shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
+              .getName(e.getName, options) == startElementName

Review Comment:
   ```suggestion
             shouldStop = if (firstEventIsStartElement) {
                          StaxXmlParserUtils.checkEndElement(parser)
                         } else {
                           true
                         }
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -211,6 +202,7 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
    */
   private def inferObject(
       parser: XMLEventReader,
+      startElementName: String,

Review Comment:
   This won't be necessary after the change below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442455921


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   `skipChildren` is called recursively. We just wanted to skip EndElement after skipping all its children



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443135604


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->

Review Comment:
   Adding multiple comments with spaces between them:
   ```suggestion
            |                    <array2>4</array2> <!--First comment--> <!--Second comment--> 
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->
          |                </struct3>
          |                <string>string</string>
          |                value12

Review Comment:
   GitHub doesn't allow to add review comment to any arbirary line. So commenting here.
   
   Add a comment between `value16` and `</ROW>` with a space before and after the comment.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {

Review Comment:
   For defensive purpose, consider passing the name of the `StartElement` and assert that it matches with the `EndElement`:
   
   ```suggestion
     def skipNextEndElement(parser: XMLEventReader): Unit = {
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2550,7 +2564,7 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |                value4
          |                <struct3>
          |                    value5
-         |                    <array2>1</array2>
+         |                    <array2>1<!--A comment within tags--></array2>

Review Comment:
   Adding multiple comments with spaces between them:
   ```suggestion
            |                    <array2>1 <!--First comment--> <!--Second comment--> </array2>
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +253,15 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
         case _: EndElement =>

Review Comment:
   Is `EndDocument` needed here as well?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   Extra events (comments, etc.) before EndElement need to be skipped.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442398779


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>

Review Comment:
   `data.trim.isEmpty` isn't always true. We do a `parser.nextEvent()`. `c.isWhiteSpace` is always true for the last event and we're checking the current event.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443186772


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   I think comments are removed before it reach parser. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #44571: [SPARK-46382][SQL] XML: Refactor inference and parsing
URL: https://github.com/apache/spark/pull/44571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1441349590


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +265,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
-        case _: EndElement =>
-          shouldStop = inferAndCheckEndElement(parser)
+        case e: EndElement =>
+          // In case of corrupt records, we shouldn't read beyond the EndDocument
+          shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
+              .getName(e.getName, options) == startElementName

Review Comment:
   We discussed offline and decided to simplify to `shouldStop = true`. With that, we will also make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we complete the parsing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443324939


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->

Review Comment:
   Added 👍



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2550,7 +2564,7 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |                value4
          |                <struct3>
          |                    value5
-         |                    <array2>1</array2>
+         |                    <array2>1<!--A comment within tags--></array2>

Review Comment:
   Added 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443324597


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   Discussed offline. The current implementation should work fine with `CDATA` and we will filter out `EntityReferences`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.

sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442378103


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {

Review Comment:
   This case (`case c: Characters if c.isWhiteSpace`) can be combined with the next one with a minor change suggested below



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>

Review Comment:
   Isn't `data.trim.isEmpty` will always be true because `c.isWhiteSpace` is true?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>

Review Comment:
   Given that `c.isWhiteSpace` is true, for `EndElement`, the previous case will be true and this will never be reached.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            StringType
           case _ => inferField(parser)
         }
       case c: Characters if !c.isWhiteSpace =>
-        val characterType = inferFrom(c.getData)
-        parser.nextEvent()
-        parser.peek match {
-          case _: StartElement =>
-            // Some more elements follow;
-            // This is a mix of values and other elements
-            val innerType = inferObject(parser).asInstanceOf[StructType]
-            addOrUpdateValueTagType(innerType, characterType)
-          case _ =>
-            val fieldType = inferField(parser)
-            fieldType match {
-              case st: StructType => addOrUpdateValueTagType(st, characterType)
-              case _: NullType => characterType
-              case _: DataType =>
-                // The field type couldn't be an array type
-                new StructType()
-                .add(options.valueTag, addOrUpdateType(Some(characterType), fieldType))
-
-            }
+        val structType = inferObject(parser).asInstanceOf[StructType]
+        structType match {
+          case simpleType
+              if structType.fields.length == 1
+              && isPrimitiveType(structType.fields.head.dataType)
+              && isValueTagField(structType.fields.head) =>
+            simpleType.fields.head.dataType
+          case _ => structType

Review Comment:
   Remove the case (`case c: Characters if c.isWhiteSpace`) and add the following case here:
   ```suggestion
             case simpleType if structType.isEmpty => NullType
             case _ => structType
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.

shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443624916


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)
               }
           }
         } catch {
           case e: SparkUpgradeException => throw e
           case NonFatal(e) =>
             badRecordException = badRecordException.orElse(Some(e))
+            StaxXmlParserUtils.skipChildren(parser)
+            StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   Discussed offline. The current implementation doesn't support partial results if there's a bad record exception and I will remove these lines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org