You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Vova Vysotskyi (Jira)" <ji...@apache.org> on 2020/12/21 18:31:00 UTC
[jira] [Updated] (DRILL-7823) Add XML Format Plugin

     [ https://issues.apache.org/jira/browse/DRILL-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vova Vysotskyi updated DRILL-7823:
----------------------------------
    Labels: ready-to-commit  (was: )

> Add XML Format Plugin
> ---------------------
>
>                 Key: DRILL-7823
>                 URL: https://issues.apache.org/jira/browse/DRILL-7823
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>              Labels: ready-to-commit
>             Fix For: 1.19.0
>
>
> # XML Format Reader
> This plugin enables Drill to read XML files without defining any kind of schema.
> ## Configuration
> Aside from the file extension, there is one configuration option:
> * `dataLevel`: XML data often contains a considerable amount of nesting which is not necesarily useful for data analysis. This parameter allows you to set the nesting level 
>   where the data actually starts.  The levels start at `1`.
> The default configuration is shown below:
> ```json
> "xml": {
>   "type": "xml",
>   "extensions": [
>     "xml"
>   ],
>   "dataLevel": 2
> }
> ```
> ## Data Types
> All fields are read as strings.  Nested fields are read as maps.  Future functionality could include support for lists.
> ## Limitations: Schema Ambiguity
> XML is a challenging format to process as the structure does not give any hints about the schema.  For example, a JSON file might have the following record:
> ```json
> "record" : {
>   "intField:" : 1,
>   "listField" : [1, 2],
>   "otherField" : {
>     "nestedField1" : "foo",
>     "nestedField2" : "bar"
>   }
> }
> ```
> From this data, it is clear that `listField` is a `list` and `otherField` is a map.  This same data could be represented in XML as follows:
> ```xml
> <record>
>   <intField>1</intField>
>   <listField>
>     <value>1</value>
>     <value>2</value>
>   </listField>
>   <otherField>
>     <nestedField1>foo</nestedField1>
>     <nestedField2>bar</nestedField2>
>   </otherField>
> </record>
> ```
> This is no problem to parse this data. But consider what would happen if we encountered the following first:
> ```xml
> <record>
>   <intField>1</intField>
>   <listField>
>     <value>2</value>
>   </listField>
>   <otherField>
>     <nestedField1>foo</nestedField1>
>     <nestedField2>bar</nestedField2>
>   </otherField>
> </record>
> ```
> In this example, there is no way for Drill to know whether `listField` is a `list` or a `map` because it only has one entry. 
> ## Future Functionality
> * **Build schema from XSD file or link**:  One of the major challenges of this reader is having to infer the schema of the data. XML files do provide a schema although this is not
>  required.  In the future, if there is interest, we can extend this reader to use an XSD file to build the schema which will be used to parse the actual XML file. 
>   
> * **Infer Date Fields**: It may be possible to add the ability to infer data fields.
> * **List Support**:  Future functionality may include the ability to infer lists from data structures.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)