You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "sandip-db (via GitHub)" <gi...@apache.org> on 2023/10/16 22:22:45 UTC

[PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

sandip-db opened a new pull request, #43389:
URL: https://github.com/apache/spark/pull/43389

   ### What changes were proposed in this pull request?
   User can specify `rowTag` option that is the name of the XML element that maps to a `DataFrame Row`.  A non-existent `rowTag` will not infer any schema or generate any `DataFrame` rows. Currently, not specifying `rowTag` option results in picking up its default value of  `ROW`, which won't match a real XML element in most scenarios. This results in an empty dataframe and confuse new users. 
   
   This PR makes `rowTag` a required option.
   
   ### Why are the changes needed?
   See above
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   New unit tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1361380449


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##########
@@ -63,8 +63,10 @@ private[sql] class XmlOptions(
   }
 
   val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
-  val rowTag = parameters.getOrElse(ROW_TAG, XmlOptions.DEFAULT_ROW_TAG)
-  require(rowTag.nonEmpty, s"'$ROW_TAG' option should not be empty string.")
+  val rowTagOpt = parameters.get(ROW_TAG)
+  require(rowTagOpt.isDefined, s"'$ROW_TAG' option is required.")

Review Comment:
   While I agree with the logic, I think we should probably add a conf to restore this behaviour back (at `SQLConf`) because we now change the default behaviour. Should also probably add a note at https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
   
   Otherwise LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1363006078


##########
sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java:
##########
@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException {
     public void testXmlParser() {
         Map<String, String> options = new HashMap<>();
         options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);

Review Comment:
   Because this change is not related to the theme, it is not recommended to modify it. But it does simplify the code a bit and is also OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1362395290


##########
sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java:
##########
@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException {
     public void testXmlParser() {
         Map<String, String> options = new HashMap<>();
         options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);

Review Comment:
   Switched to shortened version for better code readability. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1363007527


##########
sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java:
##########
@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException {
     public void testXmlParser() {
         Map<String, String> options = new HashMap<>();
         options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);

Review Comment:
   Agreed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1362021195


##########
sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java:
##########
@@ -82,7 +82,7 @@ private Path getEmptyTempDir() throws IOException {
     public void testXmlParser() {
         Map<String, String> options = new HashMap<>();
         options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);

Review Comment:
   Why change this line?



##########
sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java:
##########
@@ -92,7 +92,7 @@ public void testXmlParser() {
     public void testLoad() {
         Map<String, String> options = new HashMap<>();
         options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1362395630


##########
sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java:
##########
@@ -92,7 +92,7 @@ public void testXmlParser() {
     public void testLoad() {
         Map<String, String> options = new HashMap<>();
         options.put("rowTag", booksFileTag);
-        Dataset<Row> df = spark.read().options(options).format("xml").load(booksFile);
+        Dataset<Row> df = spark.read().options(options).xml(booksFile);

Review Comment:
   same as above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43389: [SPARK-45562][SQL] XML: Make 'rowTag' a required option
URL: https://github.com/apache/spark/pull/43389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43389:
URL: https://github.com/apache/spark/pull/43389#discussion_r1361380449


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##########
@@ -63,8 +63,10 @@ private[sql] class XmlOptions(
   }
 
   val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
-  val rowTag = parameters.getOrElse(ROW_TAG, XmlOptions.DEFAULT_ROW_TAG)
-  require(rowTag.nonEmpty, s"'$ROW_TAG' option should not be empty string.")
+  val rowTagOpt = parameters.get(ROW_TAG)
+  require(rowTagOpt.isDefined, s"'$ROW_TAG' option is required.")

Review Comment:
   While I agree with the logic, I think we should probably add a conf to restore this behaviour back (at `SQLConf`) because we now change the default behaviour. Should also probably add a note at https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
   
   Otherwise LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45562][SQL] XML: Make 'rowTag' a required option [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43389:
URL: https://github.com/apache/spark/pull/43389#issuecomment-1766234229

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org