You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "shujingyang-db (via GitHub)" <gi...@apache.org> on 2023/12/12 19:04:48 UTC

[PR] [SPARK-46382] XML: Capture values interspersed between elements [spark]

shujingyang-db opened a new pull request, #44318:
URL: https://github.com/apache/spark/pull/44318

### What changes were proposed in this pull request?

In XML, elements typically consist of a name and a value, with the value enclosed between the opening and closing tags. But XML also allows to include arbitrary values interspersed between these elements. To address this, we provide an option named `valueTags`, which is enabled by default, to capture these values. Consider the following example:

```
<ROW>
<a>1</a>
value1
<b>
value2
<c>2</c>
value3
</b>
</ROW>
```
In this example, `<a>`, `<b>`, and `<c>` are named elements with their respective values enclosed within tags. There are arbitrary values value1 value2 value3 interspersed between the elements. Please note that there can be multiple occurrences of values in a single element (i.e. there are value2, value3 in the element <b>)

We should parse the values between tags into the valueTags field. If there are multiple occurrences of value tags, the value tag field will be converted to an array type.

### Why are the changes needed?

We should parse the values otherwise there would be data loss

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org