You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2021/07/02 00:12:00 UTC
[jira] [Commented] (DRILL-7954) XML ability to not concatenate fields and attribute - change presentation of data

    [ https://issues.apache.org/jira/browse/DRILL-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373110#comment-17373110 ] 

Paul Rogers commented on DRILL-7954:
------------------------------------

[~cgivre] provides a good overview of the issue. Unlike JSON, XML syntax provides no hints of the expected structure of an element; Drill has to guess, and has to make that guess looking ahead only one token. This was quite difficult in JSON (where we at least have the `[...]` syntax, and is intractable for "plain" XML.


In addition to the solutions which Charles mentioned, one could create a custom parser, one that knows that the `<field1>` element is a list. Of course, rather than hand-coding each schema, if would be better to provide parameters to a single parser: which is where the XML schema comes in.

One can also go the other way: as Charles noted, Drill has an (obscure) provided schema feature which says the expected type of each column. This is a bass-ackward way to specify a schema: if Drill knows that `field1` is a `REPEATED VARCHAR`, then the parser can interpret `<field1>` as containing a list of strings. There are obvious limits, but this is a place to start. ([~cgivre], does the XML parser support a provided schema?)

Finally, one other choice is to use XML attributes to encode structure. I'm pretty rusty on XML, but I believe there was some standard 20 years ago that let you give the element type: `<field1 type="list:string">` or some such. We used it heavily in a SOAP API back when dinosaurs roamed... The Drill XML parser would have to understand the attributes, and your input would have to include them.

If Drill where to support the XML schema description, it would be best to do so at plan time, and compile the resulting parser outline into the execution plan. This way, the (perhaps hundreds) of readers would not all have to do the same schema downloading, parsing,  translation and error reporting. The reader could even generate Java code to implement the parser to avoid the slow and tedious interpreter-based code otherwise required.

The bottom line is that, while Drill is "schema-free", that does not mean that schemas are not needed (they are), it just means that Drill is not well suited to data that needs a schema, such as XML.

> XML ability to not concatenate fields and attribute - change presentation of data
> ---------------------------------------------------------------------------------
>
>                 Key: DRILL-7954
>                 URL: https://issues.apache.org/jira/browse/DRILL-7954
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.19.0
>            Reporter: benj
>            Priority: Major
>
> With a XML containing these data :
> {noformat}
> <a>
>   <attr>
>     <set num="0" val="1">x</set>
>     <set num="1" val="2">y</set>
>   </attr>
>   <attr>
>     <set num="2" val="a">z</set>
>     <set num="3" val="b">a</set>
>   </attr>
> </a>
> {noformat}
> {noformat}
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>1)) as x;
> +-----------------------------------------------+----------------+
> |                  attributes                   |      attr      |
> +-----------------------------------------------+----------------+
> | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} |
> +-----------------------------------------------+----------------+
> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2)) as x;
> +---------------------------------+-----+
> |           attributes            | set |
> +---------------------------------+-----+
> | {"set_num":"01","set_val":"12"} | xy  |
> | {"set_num":"23","set_val":"ab"} | za  |
> +---------------------------------+-----+
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>3)) as x;
> +------------+
> | attributes |
> +------------+
> | {}         |
> | {}         |
> | {}         |
> | {}         |
> +------------+
> {noformat}
> Attributes and fields with the same name are concatenated and remains inexploitable _(maybe the posibility of adding separator should help but it's not the point here)_
> In fact that we really need is the ability to obtain something like _(depending of the defining level)_ :
> {noformat}
> +----------------------------------------------------------------------------------+
> |                                       attr                                       |
> +----------------------------------------------------------------------------------+
> | [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}] |
> | [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}] |
> +----------------------------------------------------------------------------------+
> +------------------------------------------------+
> |                      set                       |
> +------------------------------------------------+
> | {"set":"x","_attributes":{"num":"0","val":"1"}} |
> | {"set":"y","_attributes":{"num":"1","val":"2"}} |
> | {"set":"z","_attributes":{"num":"2","val":"a"}} |
> | {"set":"a","_attributes":{"num":"3","val":"b"}} |
> +------------------------------------------------+
> {noformat}
> _attributes fields could be generated on each level instead of generated with path from top level => that will allow to work with data from each level without losing information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)