You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "laglangyue (via GitHub)" <gi...@apache.org> on 2023/10/12 08:50:06 UTC

[PR] [SPARK-44752] XML: Update Spark Docs [spark]

laglangyue opened a new pull request, #43350:
URL: https://github.com/apache/spark/pull/43350

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1364771502


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example xml_dataset python/sql/datasource.py %}
+</div>
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td><code>ROW</code></td>

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1771287294

   > It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end. @sandip-db
   
   Yes, I am working on DataFrameReader.xml. For now, change ".xml" in your python code to `.format("xml").load` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1776764308

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1765720876

   Thank you very much for your review. You are meticulous and rigorous in participating. I have already built the doc locally and executed examples for scala and Java, but there were some delays in the process due to Java 17, it looks good. But I did not execute the Python example because I have not yet used PySpark . Additionally, I have found that there are some issues with the license for checking people.xml in CI, and I don't know how to fix it. @HyukjinKwon @sandip-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1366363198


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   If we make `rowTag` option required in future, please ignore the comment I mentioned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1770418059

   It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end.
   @sandip-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1361509974


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   ```suggestion
   Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example xml_dataset python/sql/datasource.py %}
+</div>
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td><code>ROW</code></td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: 
+        <code><xmp><books><book></book>...</books></xmp></code>
+        the appropriate value would be book.

Review Comment:
   ```suggestion
           the appropriate value would be book. This is a required option.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1357952176


##########
pom.xml:
##########
@@ -283,6 +283,7 @@
       Overridable test home. So that you can call individual pom files directly without
       things breaking.
     -->
+    <session.executionRootDirectory>/tmp</session.executionRootDirectory>

Review Comment:
   I added it during local testing and forgot to roll back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43350: [SPARK-44752][SQL] XML: Update Spark Docs
URL: https://github.com/apache/spark/pull/43350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1356646393


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("

Review Comment:
   ```suggestion
   Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("
   ```
   
   I think you should markdown the code blocks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1366363198


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   If we make `rowTag` option required in everywhere in future, please ignore the comment I mentioned.



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   If we make `rowTag` option required everywhere in future, please ignore the comment I mentioned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1772896050

   ./build/mvn -pl :spark-sql_2.13 clean compile
   ![image](https://github.com/apache/spark/assets/35491928/d2a20964-2514-4931-a697-05abfdb2c829)
   it seems the construction method of XmlOptions is ambiguous
   @sandip-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1358449841


##########
examples/src/main/resources/people.xml:
##########
@@ -0,0 +1,15 @@
+<?xml version="1.0"?>
+<people>
+    <person>
+        <name>Michael</name>
+        <age>29</age>
+    </person>
+    <person>
+        <name>Andy</name>
+        <age>30</age>
+    </person>
+    <person>
+        <name>Justin</name>
+        <age>19</age>
+    </person>
+</people>

Review Comment:
   ```suggestion
   </people>
   
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: <code><books> <book></book> ...</books></code> the appropriate value would be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. Default is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is _. Can be empty for reading XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be treated as if both are just &lt;author>. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern format. This applies to timestamp type.</td>

Review Comment:
   ```suggestion
       <td>Sets the string that indicates a timestamp format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> datetime pattern</a>. This applies to timestamp type.</td>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: <code><books> <book></book> ...</books></code> the appropriate value would be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. Default is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is _. Can be empty for reading XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be treated as if both are just &lt;author>. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. This applies to date type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>locale</code></td>
+    <td><code>en-US</code></td>
+    <td>Sets a locale as a language tag in IETF BCP 47 format. For instance, locale is used while parsing dates and timestamps. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+      <td><code>rootTag</code></td>
+      <td>ROWS</td>
+      <td>Root tag of the xml files. For example, in <code><books> <book></book> ...</books></code>, the appropriate value would be books. It can include basic attributes by specifying a value like books foo="bar".</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>declaration</code></td>
+      <td><code>version="1.0" encoding="UTF-8" standalone="yes"</code></td>
+      <td>Content of XML declaration to write at the start of every output XML file, before the rootTag. For example, a value of foo causes <?xml foo?> to be written. Set to empty string to suppress</td>
+      <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>arrayElementName</code></td>
+    <td>item</td>
+    <td>Name of XML element that encloses each element of an array-valued column when writing.</td>
+    <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>nullValue</code></td>
+    <td>null</td>
+    <td>Sets the string representation of a null value. Default is string null. When this is null, it does not write attributes and elements for fields.</td>
+    <td>read</td>

Review Comment:
   ```suggestion
       <td>read/write</td>
   ```



##########
examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:
##########
@@ -101,14 +103,15 @@ public static void main(String[] args) {
       .config("spark.some.config.option", "some-value")
       .getOrCreate();
 
-    runBasicDataSourceExample(spark);
-    runGenericFileSourceOptionsExample(spark);
-    runBasicParquetExample(spark);
-    runParquetSchemaMergingExample(spark);
-    runJsonDatasetExample(spark);
-    runCsvDatasetExample(spark);
-    runTextDatasetExample(spark);
-    runJdbcDatasetExample(spark);
+//    runBasicDataSourceExample(spark);
+//    runGenericFileSourceOptionsExample(spark);
+//    runBasicParquetExample(spark);
+//    runParquetSchemaMergingExample(spark);
+//    runJsonDatasetExample(spark);
+//    runCsvDatasetExample(spark);
+//    runTextDatasetExample(spark);
+//    runJdbcDatasetExample(spark);

Review Comment:
   uncomment



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,54 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.option("rowTag", "person").xml(path)
+
+    // The inferred schema can be visualized using the printSchema() method
+    peopleDF.printSchema()
+    // root
+    //  |-- age: long (nullable = true)
+    //  |-- name: string (nullable = true)
+
+    // Creates a temporary view using the DataFrame
+    peopleDF.createOrReplaceTempView("people")
+
+    // SQL statements can be run by using the sql methods provided by spark
+    val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
+    teenagerNamesDF.show()
+    // +------+
+    // |  name|
+    // +------+
+    // |Justin|
+    // +------+
+
+    // Alternatively, a DataFrame can be created for a XML dataset represented by a Dataset[String]
+    val otherPeopleDataset = spark.createDataset(
+      """
+        |<person>
+        |    <name>laglangyue</name>
+        |    <job>Developer</job>
+        |    <age>28</age>
+        |</person>
+        |""".stripMargin :: Nil)
+    val otherPeople = spark.read
+      .option("rootTag", "people")

Review Comment:
   remove this line



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: <code><books> <book></book> ...</books></code> the appropriate value would be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. Default is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is _. Can be empty for reading XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be treated as if both are just &lt;author>. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. This applies to date type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>locale</code></td>
+    <td><code>en-US</code></td>
+    <td>Sets a locale as a language tag in IETF BCP 47 format. For instance, locale is used while parsing dates and timestamps. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+      <td><code>rootTag</code></td>
+      <td>ROWS</td>
+      <td>Root tag of the xml files. For example, in <code><books> <book></book> ...</books></code>, the appropriate value would be books. It can include basic attributes by specifying a value like books foo="bar".</td>
+      <td>read</td>

Review Comment:
   ```suggestion
         <td>write</td>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: <code><books> <book></book> ...</books></code> the appropriate value would be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. Default is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is _. Can be empty for reading XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be treated as if both are just &lt;author>. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. This applies to date type.</td>

Review Comment:
   ```suggestion
       <td>Sets the string that indicates a date format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> datetime pattern</a>. This applies to date type.</td>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: <code><books> <book></book> ...</books></code> the appropriate value would be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. Default is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is _. Can be empty for reading XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be treated as if both are just &lt;author>. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. This applies to date type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>locale</code></td>
+    <td><code>en-US</code></td>
+    <td>Sets a locale as a language tag in IETF BCP 47 format. For instance, locale is used while parsing dates and timestamps. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+      <td><code>rootTag</code></td>
+      <td>ROWS</td>
+      <td>Root tag of the xml files. For example, in <code><books> <book></book> ...</books></code>, the appropriate value would be books. It can include basic attributes by specifying a value like books foo="bar".</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>declaration</code></td>
+      <td><code>version="1.0" encoding="UTF-8" standalone="yes"</code></td>
+      <td>Content of XML declaration to write at the start of every output XML file, before the rootTag. For example, a value of foo causes <?xml foo?> to be written. Set to empty string to suppress</td>
+      <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>arrayElementName</code></td>
+    <td>item</td>
+    <td>Name of XML element that encloses each element of an array-valued column when writing.</td>
+    <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>nullValue</code></td>
+    <td>null</td>
+    <td>Sets the string representation of a null value. Default is string null. When this is null, it does not write attributes and elements for fields.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>wildcardColName</code></td>
+    <td>xs_any</td>
+    <td>Name of a column existing in the provided schema which is interpreted as a 'wildcard'. It must have type string or array of strings. It will match any XML child element that is not otherwise matched by the schema. The XML of the child becomes the string value of the column. If an array, then all unmatched elements will be returned as an array of strings. As its name implies, it is meant to emulate XSD's xs:any type.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>compression</code></td>
+    <td>none</td>
+    <td>Compression codec to use when saving to file. This can be one of the known case-insensitive shortened names (none, bzip2, gzip, lz4, snappy and deflate). XML built-in functions ignore this option.</td>
+    <td>read</td>

Review Comment:
   ```suggestion
       <td>write</td>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1364623545


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example xml_dataset python/sql/datasource.py %}
+</div>
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td><code>ROW</code></td>

Review Comment:
   Remove the default. `rowTag` is a required option now.
   ```suggestion
       <td></td>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1365794343


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   @beliefer `rowTag` is ignored by `from_xml`, `schema_of_xml` and `xml(xmlDataset: Dataset[String])`. Each of these APIs assume a single XML record that maps to a single `Row`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1761452245

   > Thank you very much for your help
   
   Please refer https://github.com/apache/spark/blob/master/docs/README.md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1759383934

   You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1357031486


##########
pom.xml:
##########
@@ -283,6 +283,7 @@
       Overridable test home. So that you can call individual pom files directly without
       things breaking.
     -->
+    <session.executionRootDirectory>/tmp</session.executionRootDirectory>

Review Comment:
   OUt of curiosity why did we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1761212472

   I don't know how to build docs locally so that I can preview HTML
   @HyukjinKwon @sandip-db 
   Thank you very much for your help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1361510319


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="python"  markdown="1">
+{% include_example xml_dataset python/sql/datasource.py %}
+</div>
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td><code>ROW</code></td>
+    <td>The row tag of your xml files to treat as a row. For example, in this xml: 
+        <code><xmp><books><book></book>...</books></xmp></code>
+        the appropriate value would be book.

Review Comment:
   ```suggestion
           the appropriate value would be book. This is a required option for both read and write.
           XML built-in functions ignore this option.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1760650718

   > You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs
   
   year,thanks.I need this, I searched this for a long time before, but not find how to build and preview locally


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1357498046


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+

Review Comment:
   Add python example



##########
examples/src/main/resources/people.xml:
##########
@@ -0,0 +1,15 @@
+<?xml version="1.0"?>
+<ROWSET>

Review Comment:
   ```suggestion
   <people>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator,
+compression, and so on.

Review Comment:
   ```suggestion
   ```suggestion
   When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
   ```



##########
examples/src/main/resources/people.xml:
##########
@@ -0,0 +1,15 @@
+<?xml version="1.0"?>
+<ROWSET>
+    <ROW>

Review Comment:
   ```suggestion
       <person>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:

Review Comment:
   ```suggestion
   Data source options of XML can be set via:
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("

Review Comment:
   ```suggestion
   Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read a file or directory of files in XML format into a Spark DataFrame, and dataframe.write().xml("
   ```



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,53 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.xml(path)

Review Comment:
   ```suggestion
       val peopleDF = spark.read.option("rowTag", "person").xml(path)
   ```



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,53 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.xml(path)
+
+    // The inferred schema can be visualized using the printSchema() method
+    peopleDF.printSchema()
+    // root
+    //  |-- age: long (nullable = true)
+    //  |-- name: string (nullable = true)
+
+    // Creates a temporary view using the DataFrame
+    peopleDF.createOrReplaceTempView("people")
+
+    // SQL statements can be run by using the sql methods provided by spark
+    val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
+    teenagerNamesDF.show()
+    // +------+
+    // |  name|
+    // +------+
+    // |Justin|
+    // +------+
+
+    // Alternatively, a DataFrame can be created for a XML dataset represented by a Dataset[String]
+    val otherPeopleDataset = spark.createDataset(
+      """
+        |<ROW>

Review Comment:
   ```suggestion
           |<person>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_json`
+    * `to_json`
+    * `schema_of_json`

Review Comment:
   ```suggestion
       * `from_xml`
       * `to_xml`
       * `schema_of_xml`
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_json`
+    * `to_json`
+    * `schema_of_json`
+* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">

Review Comment:
   Please update the table according to the information provided here:
   
   Option | Description | Scope
   --- |--- | ---
   rowTag | The row tag of your xml files to treat as a row. For example, in this xml: `<books> <book><book> ...</books>` the appropriate value would be book. Default: ROW | read
   samplingRatio | Defines fraction of rows used for schema inferring. XML built-in functions ignore this option. Default is 1.0. | read
   excludeAttribute | Whether to exclude attributes in elements. Default: false | read
   mode | Allows a mode for dealing with corrupt records during parsing.<br>`PERMISSIVE`: when it meets a corrupted record, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord` field in an output schema.<br>`DROPMALFORMED`: ignores the whole corrupted records. This mode is unsupported in the XML built-in functions.<br>`FAILFAST`: throws an exception when it meets corrupted records. | read
   inferSchema | If `true`, attempts to infer an appropriate type for each resulting DataFrame column. If `false`, all resulting columns are of string type. Default is `true`. XML built-in functions ignore this option. | read
   columnNameOfCorruptRecord | Allows renaming the new field having a malformed string created by `PERMISSIVE` mode. Default: `spark.sql.columnNameOfCorruptRecord` | read
   attributePrefix | The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is `_`. Can be empty for reading XML, but not for writing. | read / write
   valueTag | The tag used for the value when there are attributes in the element having no child. Default is `_VALUE`. | read / write
   encoding | For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. Default is `UTF-8` | read / write
   ignoreSurroundingSpaces | Defines whether surrounding whitespaces from values being read should be skipped. Default is `false`. | read
   rowValidationXSDPath | Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred. | read
   ignoreNamespace | If true, namespaces prefixes on XML elements and attributes are ignored. Tags `<abc:author>` and `<def:author>` would, for example, be treated as if both are just `<author>`. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false. Defaults to `false`. | read
   timeZone |(Defaults to `spark.sql.session.timeZone` configuration)<br>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of `timeZone` are supported:<br>    <ul>      <li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>      <li>Zone offset: It should be in the format '(+\|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>    </ul>    Other short names like 'CST' are not recommended to use because they can be ambiguous.  | read / write
   timestampFormat | Custom timestamp format string that follows the datetime pattern format. This applies to timestamp type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` | read / write
   dateFormat | Custom date format string that follows the datetime pattern format. This applies to date type. Default: `yyyy-MM-dd` | read / write
   locale | Sets a locale as a language tag in IETF BCP 47 format. For instance, locale is used while parsing dates and timestamps. Default: `en-US` | read
   rootTag | Root tag of the xml files. For example, in `<books> <book><book> ...</books>`, the appropriate value would be `books`. It can include basic attributes by specifying a value like `books foo="bar"`. Default is `ROWS`. | write
   declaration | Content of XML declaration to write at the start of every output XML file, before the rootTag. For example, a value of foo causes `<?xml foo?>` to be written. Set to empty string to suppress. Defaults to `version="1.0" encoding="UTF-8" standalone="yes"`. | write
   arrayElementName | Name of XML element that encloses each element of an array-valued column when writing. Default is `item` | write
   nullValue | Sets the string representation of a `null` value. Default is string `null`. When this is `null`, it does not write attributes and elements for fields. | read/ write
   wildcardColName | Name of a column existing in the provided schema which is interpreted as a 'wildcard'. It must have type string or array of strings. It will match any XML child element that is not otherwise matched by the schema. The XML of the child becomes the string value of the column. If an array, then all unmatched elements will be returned as an array of strings. As its name implies, it is meant to emulate XSD's `xs:any` type. Default is `xs_any`. | read
   compression | Compression codec to use when saving to file. This can be one of the known case-insensitive shortened names (none, `bzip2`, `gzip`, `lz4`, `snappy` and `deflate`). XML built-in functions ignore this option. Default: `none` | write



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,53 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.xml(path)
+
+    // The inferred schema can be visualized using the printSchema() method
+    peopleDF.printSchema()
+    // root
+    //  |-- age: long (nullable = true)
+    //  |-- name: string (nullable = true)
+
+    // Creates a temporary view using the DataFrame
+    peopleDF.createOrReplaceTempView("people")
+
+    // SQL statements can be run by using the sql methods provided by spark
+    val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
+    teenagerNamesDF.show()
+    // +------+
+    // |  name|
+    // +------+
+    // |Justin|
+    // +------+
+
+    // Alternatively, a DataFrame can be created for a XML dataset represented by a Dataset[String]
+    val otherPeopleDataset = spark.createDataset(
+      """
+        |<ROW>
+        |    <name>laglangyue</name>
+        |    <job>Developer</job>
+        |    <age>28</age>
+        |</ROW>
+        |""".stripMargin :: Nil)
+    val otherPeople = spark.read
+      .option("rowTag", "ROW")

Review Comment:
   ```suggestion
         .option("rowTag", "person")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1365386968


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   It seems not all the xml read API need `rowTag` option.
   Please refer https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "laglangyue (via GitHub)" <gi...@apache.org>.
laglangyue commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1365551314


##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   At the beginning, refer to org.apache.spark.sql.catalyst.xml.XmlOptions DEFAULT_ROW_TAG is ROW,
   and @sandip-db  the option will be modified to be a required option in the future. refer to jira.
   https://issues.apache.org/jira/browse/SPARK-45562



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,232 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+       http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.

Review Comment:
   At the beginning, refer to org.apache.spark.sql.catalyst.xml.XmlOptions DEFAULT_ROW_TAG is ROW,
   and @sandip-db  the option will be modified to be a required option in the future. refer to jira.
   https://issues.apache.org/jira/browse/SPARK-45562



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1765808314

   For people.xml, maybe you can reference https://github.com/apache/spark/pull/40249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Posted by "sandip-db (via GitHub)" <gi...@apache.org>.
sandip-db commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1773004049

   > ./build/mvn -pl :spark-sql_2.13 clean compile...  it seems the construction method of XmlOptions is ambiguous @sandip-db
   I just completed a successful run of `./build/mvn -DskipTests clean package`.
   It looks like `mvn` in your case is picking up stale dependencies.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org