You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "shujingyang-db (via GitHub)" <gi...@apache.org> on 2024/01/03 08:01:31 UTC

[PR] [SPARK-46382][SQL] XML: Capture values interspersed between elements follow-up [spark]

shujingyang-db opened a new pull request, #44571:
URL: https://github.com/apache/spark/pull/44571

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This follow-up simplifies the handling of value tags.
   
   As value tags only exist in structure data, their handling will be confined to the inferObject method, eliminating the need for processing in inferField. This implies that when we encounter non-whitespace characters, we can invoke inferObject. For structures with a single primitive field, we'll simplify them into primitive types.
   The inferAndCheckEndElement function is updated to align with this approach.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   This follow-up simplifies the handling of value tags.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Unit tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443187199


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            StringType
           case _ => inferField(parser)
         }
       case c: Characters if !c.isWhiteSpace =>
-        val characterType = inferFrom(c.getData)
-        parser.nextEvent()
-        parser.peek match {
-          case _: StartElement =>
-            // Some more elements follow;
-            // This is a mix of values and other elements
-            val innerType = inferObject(parser).asInstanceOf[StructType]
-            addOrUpdateValueTagType(innerType, characterType)
-          case _ =>
-            val fieldType = inferField(parser)
-            fieldType match {
-              case st: StructType => addOrUpdateValueTagType(st, characterType)
-              case _: NullType => characterType
-              case _: DataType =>
-                // The field type couldn't be an array type
-                new StructType()
-                .add(options.valueTag, addOrUpdateType(Some(characterType), fieldType))
-
-            }
+        val structType = inferObject(parser).asInstanceOf[StructType]
+        structType match {
+          case simpleType
+              if structType.fields.length == 1
+              && isPrimitiveType(structType.fields.head.dataType)
+              && isValueTagField(structType.fields.head) =>
+            simpleType.fields.head.dataType
+          case _ => structType

Review Comment:
   👍 Added `case _ if structType.fields.isEmpty =>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442428458


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   ```suggestion
           parser.next
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        null
       case (c: Characters, ArrayType(st, _)) =>
         // For `ArrayType`, it needs to return the type of element. The values are merged later.
         parser.next
-        convertTo(c.getData, st)
-      case (c: Characters, st: StructType) =>
-        parser.next
-        parser.peek match {
-          case _: EndElement =>
-            // It couldn't be an array of value tags
-            // as the opening tag is immediately followed by a closing tag.
-            if (c.isWhiteSpace) {
-              return null
-            }
-            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
-            indexOpt match {
-              case Some(index) =>
-                convertTo(c.getData, st.fields(index).dataType)
-              case None => null
-            }
-          case _ =>
-            val row = convertObject(parser, st)
-            if (!c.isWhiteSpace) {
-              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
-            } else {
-              row
-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   The check in this function will fail if the next event is Comment, Cdata, etc.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   Why not skip EndElement in skipChildren itself?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -151,6 +151,7 @@ object StaxXmlParserUtils {
         indent > 0
       case _ => true
     })
+    consumeNextEndElement(parser)

Review Comment:
   ```suggestion
           if (indent <= 0) parser.next
           indent > 0
         case _ => true
       })
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   ```suggestion
           parser.next
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        null
       case (c: Characters, ArrayType(st, _)) =>
         // For `ArrayType`, it needs to return the type of element. The values are merged later.
         parser.next
-        convertTo(c.getData, st)
-      case (c: Characters, st: StructType) =>
-        parser.next
-        parser.peek match {
-          case _: EndElement =>
-            // It couldn't be an array of value tags
-            // as the opening tag is immediately followed by a closing tag.
-            if (c.isWhiteSpace) {
-              return null
-            }
-            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
-            indexOpt match {
-              case Some(index) =>
-                convertTo(c.getData, st.fields(index).dataType)
-              case None => null
-            }
-          case _ =>
-            val row = convertObject(parser, st)
-            if (!c.isWhiteSpace) {
-              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
-            } else {
-              row
-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        value
+      case (_: Characters, st: StructType) =>
+        convertObject(parser, st)
       case (_: Characters, _: StringType) =>
         convertTo(StaxXmlParserUtils.currentStructureAsString(parser), StringType)
       case (c: Characters, _: DataType) if c.isWhiteSpace =>
         // When `Characters` is found, we need to look further to decide
         // if this is really data or space between other elements.
-        val data = c.getData
         parser.next
-        parser.peek match {
-          case _: StartElement => convertComplicatedType(dataType, attributes)
-          case _: EndElement if data.isEmpty => null
-          case _: EndElement => convertTo(data, dataType)
-          case _ => convertField(parser, dataType, attributes)
-        }
+        convertField(parser, dataType, attributes)
       case (c: Characters, dt: DataType) =>
+        val value = convertTo(c.getData, dt)
         parser.next
-        convertTo(c.getData, dt)
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   The check in this function will fail if the next event is Comment, Cdata, etc.
   `<ROW><foo attr="a">1<!--This is a comment--></foo></ROW>`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)
               }
           }
         } catch {
           case e: SparkUpgradeException => throw e
           case NonFatal(e) =>
             badRecordException = badRecordException.orElse(Some(e))
+            StaxXmlParserUtils.skipChildren(parser)
+            StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   This change is not needed. 
   This may be a parsing exception. Parsing any further will likely result in more exception. And `badRecordException ` will be lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1440997050


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -235,22 +227,6 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
     val nameToDataType =
       collection.mutable.TreeMap.empty[String, DataType](caseSensitivityOrdering)
 

Review Comment:
   Define `firstEventIsStartElement` here:
   
   `val firstEventIsStartElement = parser.peek().isStartElement`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443381162


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +253,15 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag

Review Comment:
   ```suggestion
             // This is a value tag
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -603,4 +518,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         newType
     }
   }
+
+  private[xml] def isPrimitiveType(dataType: DataType): Boolean = {
+    dataType match {
+      case _: StructType => false
+      case _: ArrayType => false
+      case _: MapType => false
+      case _ => true
+    }
+  }
+
+  private[xml] def isValueTagField(structField: StructField): Boolean = {
+    structField.name.toLowerCase(Locale.ROOT) == options.valueTag.toLowerCase(Locale.ROOT)

Review Comment:
   Shouldn't this be based on `caseSensitive` option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44571:
URL: https://github.com/apache/spark/pull/44571#issuecomment-1879900195

   https://github.com/apache/spark/pull/44601 had a conflict with this PR. would you mind resolving the conflicts please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443292272


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->
          |                </struct3>
          |                <string>string</string>
          |                value12

Review Comment:
   Added 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44571:
URL: https://github.com/apache/spark/pull/44571#issuecomment-1879899986

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442456803


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -208,56 +202,36 @@ class StaxXmlParser(
     (parser.peek, dataType) match {
       case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
       case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
         // Empty. It's null if "" is the null value
         if (options.nullValue == "") {
           null
         } else {
           UTF8String.fromString("")
         }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)
+        null
       case (c: Characters, ArrayType(st, _)) =>
         // For `ArrayType`, it needs to return the type of element. The values are merged later.
         parser.next
-        convertTo(c.getData, st)
-      case (c: Characters, st: StructType) =>
-        parser.next
-        parser.peek match {
-          case _: EndElement =>
-            // It couldn't be an array of value tags
-            // as the opening tag is immediately followed by a closing tag.
-            if (c.isWhiteSpace) {
-              return null
-            }
-            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
-            indexOpt match {
-              case Some(index) =>
-                convertTo(c.getData, st.fields(index).dataType)
-              case None => null
-            }
-          case _ =>
-            val row = convertObject(parser, st)
-            if (!c.isWhiteSpace) {
-              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
-            } else {
-              row
-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   Got you! I guess if the next event is Comment, Cdata, etc., we wanted to skip those events as well and also the end element.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443324753


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {

Review Comment:
   Added an assertion on the endElementName



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443212881


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   You are right. We are filtering comments [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala#L43). Do we need special handling for CDATA or EntityReferences?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44571:
URL: https://github.com/apache/spark/pull/44571#issuecomment-1880225723

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1440997050


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -235,22 +227,6 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
     val nameToDataType =
       collection.mutable.TreeMap.empty[String, DataType](caseSensitivityOrdering)
 

Review Comment:
   Define `firstEventIsStartElement` here:
   
   `val firstEventIsStartElement = parser.peek().isStartElement`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +265,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
-        case _: EndElement =>
-          shouldStop = inferAndCheckEndElement(parser)
+        case e: EndElement =>
+          // In case of corrupt records, we shouldn't read beyond the EndDocument
+          shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
+              .getName(e.getName, options) == startElementName

Review Comment:
   ```suggestion
             shouldStop = if (firstEventIsStartElement) {
                          StaxXmlParserUtils.checkEndElement(parser)
                         } else {
                           true
                         }
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -211,6 +202,7 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
    */
   private def inferObject(
       parser: XMLEventReader,
+      startElementName: String,

Review Comment:
   This won't be necessary after the change below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442455921


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   `skipChildren` is called recursively. We just wanted to skip EndElement after skipping all its children



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443135604


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->

Review Comment:
   Adding multiple comments with spaces between them:
   ```suggestion
            |                    <array2>4</array2> <!--First comment--> <!--Second comment--> 
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->
          |                </struct3>
          |                <string>string</string>
          |                value12

Review Comment:
   GitHub doesn't allow to add review comment to any arbirary line. So commenting here.
   
   Add a comment between `value16` and `</ROW>` with a space before and after the comment.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {

Review Comment:
   For defensive purpose, consider passing the name of the `StartElement` and assert that it matches with the `EndElement`:
   
   ```suggestion
     def skipNextEndElement(parser: XMLEventReader): Unit = {
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2550,7 +2564,7 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |                value4
          |                <struct3>
          |                    value5
-         |                    <array2>1</array2>
+         |                    <array2>1<!--A comment within tags--></array2>

Review Comment:
   Adding multiple comments with spaces between them:
   ```suggestion
            |                    <array2>1 <!--First comment--> <!--Second comment--> </array2>
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +253,15 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
         case _: EndElement =>

Review Comment:
   Is `EndDocument` needed here as well?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   Extra events (comments, etc.) before EndElement need to be skipped.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442398779


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>

Review Comment:
   `data.trim.isEmpty` isn't always true. We do a `parser.nextEvent()`. `c.isWhiteSpace` is always true for the last event and we're checking the current event.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443186772


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   I think comments are removed before it reach parser. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #44571: [SPARK-46382][SQL] XML: Refactor inference and parsing
URL: https://github.com/apache/spark/pull/44571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1441349590


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +265,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
-        case _: EndElement =>
-          shouldStop = inferAndCheckEndElement(parser)
+        case e: EndElement =>
+          // In case of corrupt records, we shouldn't read beyond the EndDocument
+          shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
+              .getName(e.getName, options) == startElementName

Review Comment:
   We discussed offline and decided to simplify to `shouldStop = true`. With that, we will also make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we complete the parsing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443324939


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2561,10 +2575,10 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |            </array1>
          |            value10
          |            <array1>
-         |                <struct3>
+         |                <struct3><!--A comment within tags-->
          |                    <array2>3</array2>
          |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->

Review Comment:
   Added 👍



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -2550,7 +2564,7 @@ class XmlSuite extends QueryTest with SharedSparkSession {
          |                value4
          |                <struct3>
          |                    value5
-         |                    <array2>1</array2>
+         |                    <array2>1<!--A comment within tags--></array2>

Review Comment:
   Added 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443324597


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala:
##########
@@ -178,4 +179,11 @@ object StaxXmlParserUtils {
       }
     }
   }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {
+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }

Review Comment:
   Discussed offline. The current implementation should work fine with `CDATA` and we will filter out `EntityReferences`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1442378103


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {

Review Comment:
   This case (`case c: Characters if c.isWhiteSpace`) can be combined with the next one with a minor change suggested below



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>

Review Comment:
   Isn't `data.trim.isEmpty` will always be true because `c.isWhiteSpace` is true?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>

Review Comment:
   Given that `c.isWhiteSpace` is true, for `EndElement`, the previous case will be true and this will never be reached.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -175,31 +174,26 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
         parser.nextEvent()
         parser.peek match {
           case _: StartElement => inferObject(parser)
-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            StringType
           case _ => inferField(parser)
         }
       case c: Characters if !c.isWhiteSpace =>
-        val characterType = inferFrom(c.getData)
-        parser.nextEvent()
-        parser.peek match {
-          case _: StartElement =>
-            // Some more elements follow;
-            // This is a mix of values and other elements
-            val innerType = inferObject(parser).asInstanceOf[StructType]
-            addOrUpdateValueTagType(innerType, characterType)
-          case _ =>
-            val fieldType = inferField(parser)
-            fieldType match {
-              case st: StructType => addOrUpdateValueTagType(st, characterType)
-              case _: NullType => characterType
-              case _: DataType =>
-                // The field type couldn't be an array type
-                new StructType()
-                .add(options.valueTag, addOrUpdateType(Some(characterType), fieldType))
-
-            }
+        val structType = inferObject(parser).asInstanceOf[StructType]
+        structType match {
+          case simpleType
+              if structType.fields.length == 1
+              && isPrimitiveType(structType.fields.head.dataType)
+              && isValueTagField(structType.fields.head) =>
+            simpleType.fields.head.dataType
+          case _ => structType

Review Comment:
   Remove the case (`case c: Characters if c.isWhiteSpace`) and add the following case here:
   ```suggestion
             case simpleType if structType.isEmpty => NullType
             case _ => structType
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46382][SQL] XML: Refactor inference and parsing [spark]

Posted by "shujingyang-db (via GitHub)" <gi...@apache.org>.
shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1443624916


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##########
@@ -423,19 +396,22 @@ class StaxXmlParser(
                 }
               } else {
                 StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)
               }
           }
         } catch {
           case e: SparkUpgradeException => throw e
           case NonFatal(e) =>
             badRecordException = badRecordException.orElse(Some(e))
+            StaxXmlParserUtils.skipChildren(parser)
+            StaxXmlParserUtils.consumeNextEndElement(parser)

Review Comment:
   Discussed offline. The current implementation doesn't support partial results if there's a bad record exception and I will remove these lines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org