You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "sandip-db (via GitHub)" <gi...@apache.org> on 2024/01/11 06:13:01 UTC

[PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

sandip-db opened a new pull request, #44685:
URL: https://github.com/apache/spark/pull/44685

   ### What changes were proposed in this pull request?
   Spark-XML library users will notice some changes with built-in XML. Notably, some of the character data that was earlier dropped by spark-xml will be faithfully parsed by the built-in XML. Support for new data types like DecimalType, TimeStampNTZ, etc. have also been added. Few spark-xml users may still want to continue using the library. 
   
   Rather than implicitly switching to the built-in XML, this PR throws an error when an external data source is detected for XML format.
   
   Users can continue using the old spark-xml library by specifying the full package name. Alternatively, they can remove spark-xml from the classpath and switch to built-in XML.
   
   ### Why are the changes needed?
   Same as above
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   
   ### How was this patch tested?
   New unit test
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44685:
URL: https://github.com/apache/spark/pull/44685#discussion_r1449364116


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2661,6 +2661,11 @@
     ],
     "sqlState" : "42K0E"
   },
+  "MULTIPLE_XML_DATA_SOURCE" : {

Review Comment:
   Sorry.. It's UserDefinedDataSource and not UDF. I read it wrong. 
   
   Nevertheless, I can't think of a way to combine the messaging in `FOUND_MULTIPLE_DATA_SOURCES` and `MULTIPLE_XML_DATA_SOURCE` as the later is drawing a distinction between internal and external data source and providing guidance on how to use one or the other.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44685:
URL: https://github.com/apache/spark/pull/44685#discussion_r1449241661


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2661,6 +2661,11 @@
     ],
     "sqlState" : "42K0E"
   },
+  "MULTIPLE_XML_DATA_SOURCE" : {

Review Comment:
   > FOUND_MULTIPLE_DATA_SOURCES is currently being used for user defined functions.
   
   Sorry, I didn't get how it is related to UDF. Could you elaborate this a little bit more.  As I can see the error comes from `DataSource` at https://github.com/apache/spark/blob/4b0ceaddde17a870289f20269526fba39242bbda/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L679-L682
   
   > This message won't be applicable to the UDF scenario for which FOUND_MULTIPLE_DATA_SOURCES is being used.
   
   Where is UDF there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44685:
URL: https://github.com/apache/spark/pull/44685#issuecomment-1888191813

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #44685: [SPARK-46667][SQL] XML: Throw error on multiple XML data source
URL: https://github.com/apache/spark/pull/44685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #44685:
URL: https://github.com/apache/spark/pull/44685#discussion_r1449173137


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2661,6 +2661,11 @@
     ],
     "sqlState" : "42K0E"
   },
+  "MULTIPLE_XML_DATA_SOURCE" : {

Review Comment:
   @MaxGekk `FOUND_MULTIPLE_DATA_SOURCES` is currently being used for user defined functions.
   
   For csv and other built-in formats, spark implicitly chooses the built-in data source. While we expect that most of the spark-xml users would prefer to use the new built-in XML, a few of them may still want to continue using the external library. So we wanted to be explicit when an external xml data source is detected by throwing an error with a descriptive message that tells users to either  use the full classpath name or remove the external library from the classpath. This message won't be applicable to the UDF scenario for which `FOUND_MULTIPLE_DATA_SOURCES` is being used.
   
   `_LEGACY_ERROR_TEMP_1141` and `FOUND_MULTIPLE_DATA_SOURCES` seems duplicative though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46667][SQL] XML: Throw error on multiple XML data source [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44685:
URL: https://github.com/apache/spark/pull/44685#discussion_r1448391961


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2661,6 +2661,11 @@
     ],
     "sqlState" : "42K0E"
   },
+  "MULTIPLE_XML_DATA_SOURCE" : {

Review Comment:
   Can't you re-use the existing error class:
   https://github.com/apache/spark/blob/48d429dd801441839625dc4917b5a64ef70ced18/common/utils/src/main/resources/error/error-classes.json#L1245
   If it is needed, you could extend/improve it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org