You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Vova Vysotskyi (Jira)" <ji...@apache.org> on 2020/12/21 18:31:00 UTC
[jira] [Updated] (DRILL-7823) Add XML Format Plugin
[ https://issues.apache.org/jira/browse/DRILL-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vova Vysotskyi updated DRILL-7823:
----------------------------------
Labels: ready-to-commit (was: )
> Add XML Format Plugin
> ---------------------
>
> Key: DRILL-7823
> URL: https://issues.apache.org/jira/browse/DRILL-7823
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Labels: ready-to-commit
> Fix For: 1.19.0
>
>
> # XML Format Reader
> This plugin enables Drill to read XML files without defining any kind of schema.
> ## Configuration
> Aside from the file extension, there is one configuration option:
> * `dataLevel`: XML data often contains a considerable amount of nesting which is not necesarily useful for data analysis. This parameter allows you to set the nesting level
> where the data actually starts. The levels start at `1`.
> The default configuration is shown below:
> ```json
> "xml": {
> "type": "xml",
> "extensions": [
> "xml"
> ],
> "dataLevel": 2
> }
> ```
> ## Data Types
> All fields are read as strings. Nested fields are read as maps. Future functionality could include support for lists.
> ## Limitations: Schema Ambiguity
> XML is a challenging format to process as the structure does not give any hints about the schema. For example, a JSON file might have the following record:
> ```json
> "record" : {
> "intField:" : 1,
> "listField" : [1, 2],
> "otherField" : {
> "nestedField1" : "foo",
> "nestedField2" : "bar"
> }
> }
> ```
> From this data, it is clear that `listField` is a `list` and `otherField` is a map. This same data could be represented in XML as follows:
> ```xml
> <record>
> <intField>1</intField>
> <listField>
> <value>1</value>
> <value>2</value>
> </listField>
> <otherField>
> <nestedField1>foo</nestedField1>
> <nestedField2>bar</nestedField2>
> </otherField>
> </record>
> ```
> This is no problem to parse this data. But consider what would happen if we encountered the following first:
> ```xml
> <record>
> <intField>1</intField>
> <listField>
> <value>2</value>
> </listField>
> <otherField>
> <nestedField1>foo</nestedField1>
> <nestedField2>bar</nestedField2>
> </otherField>
> </record>
> ```
> In this example, there is no way for Drill to know whether `listField` is a `list` or a `map` because it only has one entry.
> ## Future Functionality
> * **Build schema from XSD file or link**: One of the major challenges of this reader is having to infer the schema of the data. XML files do provide a schema although this is not
> required. In the future, if there is interest, we can extend this reader to use an XSD file to build the schema which will be used to parse the actual XML file.
>
> * **Infer Date Fields**: It may be possible to add the ability to infer data fields.
> * **List Support**: Future functionality may include the ability to infer lists from data structures.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)