You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/10/19 10:55:43 UTC

[PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

beliefer opened a new pull request, #43455:
URL: https://github.com/apache/spark/pull/43455

   ### What changes were proposed in this pull request?
   https://github.com/apache/spark/pull/43389 makes `rowTag` a required option. But the xml API (please see https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1) is unrelated to `rowTag`.
   
   This PR also improve some code and remove the unused code.
   
   
   ### Why are the changes needed?
   Restore test case not require `rowTag` option.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   
   
   ### How was this patch tested?
   Exists test cases.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43455:
URL: https://github.com/apache/spark/pull/43455#issuecomment-1771929794

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43455: [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option.
URL: https://github.com/apache/spark/pull/43455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43455:
URL: https://github.com/apache/spark/pull/43455#discussion_r1366303706


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -1209,10 +1209,9 @@ class XmlSuite extends QueryTest with SharedSparkSession {
       "<ROW><year>2015</year><make>Chevy</make><model>Volt</model><comment>No</comment></ROW>")
     val xmlRDD = spark.sparkContext.parallelize(data)
     val ds = spark.createDataset(xmlRDD)(Encoders.STRING)
-    assert(spark.read.option("rowTag", "ROW").xml(ds).collect().length === 3)

Review Comment:
   It is [assumed](https://github.com/apache/spark/blob/8fd915ffaba1cc99813cc8d6d2a28688d7fae39b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L576) that each XML string in the `Dataset` contains only one `Row`. So the parser doesn't look for `rowTag` as it iterates over all the strings in the Dataset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43455:
URL: https://github.com/apache/spark/pull/43455#issuecomment-1771871187

   cc @sandip-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43455:
URL: https://github.com/apache/spark/pull/43455#issuecomment-1771939025

   @HyukjinKwon @sandip-db Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43455:
URL: https://github.com/apache/spark/pull/43455#discussion_r1366294114


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##########
@@ -66,9 +66,9 @@ private[sql] class XmlOptions(
   }
 
   val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
-  val rowTagOpt = parameters.get(XmlOptions.ROW_TAG)
+  val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
   require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.")

Review Comment:
   @sandip-db do you remember why we need `rowTagRequired` variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43455:
URL: https://github.com/apache/spark/pull/43455#discussion_r1366297384


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##########
@@ -66,9 +66,9 @@ private[sql] class XmlOptions(
   }
 
   val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
-  val rowTagOpt = parameters.get(XmlOptions.ROW_TAG)
+  val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
   require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.")

Review Comment:
   I added it as part of my last [PR](https://github.com/apache/spark/pull/43416/files). This is because `rowTag` is required only for file read/write (`XmlFIleFormat`) and ignored by `from_xml`, `schema_of_xml` and `DataFrameReader.xml` that takes `Dataset` as argument. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL][FOLLOWUP] Restore test case not require `rowTag` option. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43455:
URL: https://github.com/apache/spark/pull/43455#discussion_r1366292552


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala:
##########
@@ -1209,10 +1209,9 @@ class XmlSuite extends QueryTest with SharedSparkSession {
       "<ROW><year>2015</year><make>Chevy</make><model>Volt</model><comment>No</comment></ROW>")
     val xmlRDD = spark.sparkContext.parallelize(data)
     val ds = spark.createDataset(xmlRDD)(Encoders.STRING)
-    assert(spark.read.option("rowTag", "ROW").xml(ds).collect().length === 3)

Review Comment:
   Wait, we still need this to be required I guess? `ROW` can be different in the text Dataset?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org