You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Mike Beckerle <mb...@apache.org> on 2023/10/07 17:46:40 UTC

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

Ok, after weeks of delay....

That helps a great deal. You flatten the array of maps into a table of maps.

I am confused still about when I must do square brackets versus dot
notation: data['a'] vs. data.a
The JSON documentation for Drill uses dot notation to reach into fields of
a map.

Ex: from the JSON doc:

{
  "type": "FeatureCollection",
  "features": [
  {
    "type": "Feature",
    "properties":
    {
      "MAPBLKLOT": "0001001",
      "BLKLOT": "0001001",
      "BLOCK_NUM": "0001",
      "LOT_NUM": "001",
       ....

The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
Which is using dot notation where in your queries on my JSON you did not
use dot notation.

I tried revising the queries you wrote using the dot notation, and it was
rejected. "no table named 'data'", but I'm not sure why.

Ex:

This works: (your original working query)

SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;

But this fails:

SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
Table 'data' not found

But your sub-select defines 'data' as, I would assume, a table.

Can you help me clarify this?

[Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
apache drill>


On Mon, Sep 18, 2023 at 11:17 PM Charles Givre <cg...@gmail.com> wrote:

> Hi Mike,
> Let me answer your question with some queries:
>
>  >>> select * from dfs.test.`record.json`;
>
> +----------------------------------------------------------------------------------+
> |                                      record
>          |
>
> +----------------------------------------------------------------------------------+
> |
> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
> |
>
> +----------------------------------------------------------------------------------+
>
> Now... I can flatten that like this:
>
> >>> select flatten(record) AS data from dfs.test.`record.json`;
> +----------------------------------------------+
> |                     data                     |
> +----------------------------------------------+
> | {"a":{"a1":5.0,"a2":6.0},"b":{}}             |
> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
> | {"a":{"a1":7.0,"a2":8.0},"b":{}}             |
> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
> +----------------------------------------------+
> 4 rows selected (0.298 seconds)
>
> You asked about filtering.   For this, I broke it up into a subquery, but
> here's how I did that:
>
> >>> SELECT data['a'], data['b']
> 2..semicolon> FROM (select flatten(record) AS data from
> dfs.test.`record.json`)
> 3..semicolon> WHERE data['b']['b1'] > 60.0;
> +--------+---------------------------------+
> | EXPR$0 |             EXPR$1              |
> +--------+---------------------------------+
> | {}     | {"b1":77.0,"b2":88.0,"b3":99.0} |
> +--------+---------------------------------+
> 1 row selected (0.379 seconds)
>
> I did all this without the union data type.
>
> Does this make sense?
> Best,
> -- C
>
>
> On Sep 13, 2023, at 11:08 AM, Mike Beckerle <mb...@apache.org> wrote:
>
> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
>    { "a": { "a1":5, "a2":6 } },
>    { "b": { "b1":55, "b2":66, "b3":77 } }
>    { "a": { "a1":7, "a2":8 } },
>    { "b": { "b1":77, "b2":88, "b3":99 } }
>  ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
>    105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
>
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
>
> b1, b2
> ---------
> (55, 66)
> (77, 88)
>
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
>
> <element name="record" maxOccurs="unbounded">
>   <complexType>
>      <choice><!-- there are sub-record types, a, b,... there could be many
> dozens of these -->
>          <element name="a">
>               <complexType>
>                    <sequence>
>                        ... many child elements let's say named a1, a2, ...
>                     </sequence>
>               </complexType>
>          </element>
>          <element name="b">
>               <complexType>
>                    <sequence>
>                        ... many child elements let's say named b1, b2, b3
> ...
>                     </sequence>
>               </complexType>
>          </element>
>    </choice>
>  </complexType>
> </element>
>
> To me XSD choice naturally requires a Union feature of some sort.
> If that's expermental still in Drill ... what to do?
>
> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre <notifications@github.com
> >
> wrote:
>
> @mbeckerle <https://github.com/mbeckerle>
> You've encountered another challenge that exists in Drill reading data
> without a schema.
> Let me explain a bit about this and I'm going to use the JSON reader as an
> example. First Drill requires data to be homogeneous. Drill does have a
> Union vector type which allows heterogeneous data however this is a bit
> experimental and I wouldn't recommend using it. Also, it really just shifts
> schema inconsistencies to the user.
>
> For instance, let's say you have a column consisting of strings and
> floats. What happens if you try to do something like this:
>
> SELECT sum(mixed_col)-- orSELECT.... ORDER BY mixed_col
>
> Remembering that Drill is distributed and if you have a column with the
> same name and you try to do these operations, they will fail.
>
> Let's say we have data like this:
>
> [
>  {
>     'col1': 'Hi there',
>     'col2': 5.0
>  },
>  {
>     'col1':True,
>     'col2': 4,
>     'col3': 'foo'
>  }
> ]
>
> In older versions of Drill, this kind of data, this would throw all kinds
> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
> <https://github.com/jnturton> submitted apache#2638
> <https://github.com/apache/drill/pull/2638> which overhauled implicit
> casting. What this meant for users is that col2 in the above would be
> automatically cast to a FLOAT and col1 would be automatically cast to a
> VARCHAR.
>
> However, when reading data the story is a little different. What we did
> for the JSON reader was have several read modes. The least tolerant
> attempts to infer all data types. This seems like a great idea in practice,
> however when you start actually using Drill with real data, you start
> seeing the issues with this approach. The JSON reader has a few
> configuration options that increase its tolerance for bad data. The next
> level is readAllNumbersAsDouble which... as the name implies, reads all
> numeric data as Doubles and does not attempt to infer ints vs floats. The
> next options is allTextMode which reads all fields as VARCHAR. This
> should be used when the data is so inconsistent that it cannot be read with
> either mode. These modes can be set globally, at the plugin level or at
> query time.
>
> For the XML reader, I didn't add type inference because I figured the data
> would be quite messy, however it wouldn't be that hard to add basically the
> same levels as the JSON reader.
>
> This fundamental issue exists in all the readers that read data without a
> schema. My rationale for working on the XSD reader is that this will enable
> us to accurately read XML data with all the correct data types.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/cgivre/drill/pull/6#issuecomment-1666875922>, or
> unsubscribe
> <
> https://github.com/notifications/unsubscribe-auth/AALUDAZZ6T6Z44AW44IKD2LXT6RVNANCNFSM6AAAAAA26ZZVQ4
> >
> .
> You are receiving this because you were mentioned.Message ID:
> <cg...@github.com>
>
>
>

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

Posted by Mike Beckerle <mb...@apache.org>.
Nevermind. I figured this out. Was due to 'properties' being a reserved
keyword. I created a PR to fix the JSON doc on the drill site.

On Sat, Oct 7, 2023 at 1:46 PM Mike Beckerle <mb...@apache.org> wrote:

> Ok, after weeks of delay....
>
> That helps a great deal. You flatten the array of maps into a table of
> maps.
>
> I am confused still about when I must do square brackets versus dot
> notation: data['a'] vs. data.a
> The JSON documentation for Drill uses dot notation to reach into fields of
> a map.
>
> Ex: from the JSON doc:
>
> {
>   "type": "FeatureCollection",
>   "features": [
>   {
>     "type": "Feature",
>     "properties":
>     {
>       "MAPBLKLOT": "0001001",
>       "BLKLOT": "0001001",
>       "BLOCK_NUM": "0001",
>       "LOT_NUM": "001",
>        ....
>
> The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
> Which is using dot notation where in your queries on my JSON you did not
> use dot notation.
>
> I tried revising the queries you wrote using the dot notation, and it was
> rejected. "no table named 'data'", but I'm not sure why.
>
> Ex:
>
> This works: (your original working query)
>
> SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;
>
> But this fails:
>
> SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
> Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
> Table 'data' not found
>
> But your sub-select defines 'data' as, I would assume, a table.
>
> Can you help me clarify this?
>
> [Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
> apache drill>
>
>
> On Mon, Sep 18, 2023 at 11:17 PM Charles Givre <cg...@gmail.com> wrote:
>
>> Hi Mike,
>> Let me answer your question with some queries:
>>
>>  >>> select * from dfs.test.`record.json`;
>>
>> +----------------------------------------------------------------------------------+
>> |                                      record
>>          |
>>
>> +----------------------------------------------------------------------------------+
>> |
>> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
>> |
>>
>> +----------------------------------------------------------------------------------+
>>
>> Now... I can flatten that like this:
>>
>> >>> select flatten(record) AS data from dfs.test.`record.json`;
>> +----------------------------------------------+
>> |                     data                     |
>> +----------------------------------------------+
>> | {"a":{"a1":5.0,"a2":6.0},"b":{}}             |
>> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
>> | {"a":{"a1":7.0,"a2":8.0},"b":{}}             |
>> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
>> +----------------------------------------------+
>> 4 rows selected (0.298 seconds)
>>
>> You asked about filtering.   For this, I broke it up into a subquery, but
>> here's how I did that:
>>
>> >>> SELECT data['a'], data['b']
>> 2..semicolon> FROM (select flatten(record) AS data from
>> dfs.test.`record.json`)
>> 3..semicolon> WHERE data['b']['b1'] > 60.0;
>> +--------+---------------------------------+
>> | EXPR$0 |             EXPR$1              |
>> +--------+---------------------------------+
>> | {}     | {"b1":77.0,"b2":88.0,"b3":99.0} |
>> +--------+---------------------------------+
>> 1 row selected (0.379 seconds)
>>
>> I did all this without the union data type.
>>
>> Does this make sense?
>> Best,
>> -- C
>>
>>
>> On Sep 13, 2023, at 11:08 AM, Mike Beckerle <mb...@apache.org> wrote:
>>
>> I'm thinking whether a first prototype of DFDL integration to Drill should
>> just use JSON.
>>
>> But please consider this JSON:
>>
>> { "record": [
>>    { "a": { "a1":5, "a2":6 } },
>>    { "b": { "b1":55, "b2":66, "b3":77 } }
>>    { "a": { "a1":7, "a2":8 } },
>>    { "b": { "b1":77, "b2":88, "b3":99 } }
>>  ] }
>>
>> It corresponds to this text data file, parsed using Daffodil:
>>
>>    105062556677107082778899
>>
>> The file is a stream of records. The first byte is a tag value 1 for type
>> 'a' records, and 2 for type 'b' records.
>> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
>> a2. They are integers.
>> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1,
>> b2,
>> and b3. They are integers.
>> This kind of format is very common, even textualized like this (from COBOL
>> programs for example)
>>
>> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
>> (and ... does this require the experimental Union feature?)
>>
>> b1, b2
>> ---------
>> (55, 66)
>> (77, 88)
>>
>> I ask because in an XML Schema or DFDL schema choices with dozens of
>> 'branches' are very common.
>> Ex: schema for the above data:
>>
>> <element name="record" maxOccurs="unbounded">
>>   <complexType>
>>      <choice><!-- there are sub-record types, a, b,... there could be many
>> dozens of these -->
>>          <element name="a">
>>               <complexType>
>>                    <sequence>
>>                        ... many child elements let's say named a1, a2, ...
>>                     </sequence>
>>               </complexType>
>>          </element>
>>          <element name="b">
>>               <complexType>
>>                    <sequence>
>>                        ... many child elements let's say named b1, b2, b3
>> ...
>>                     </sequence>
>>               </complexType>
>>          </element>
>>    </choice>
>>  </complexType>
>> </element>
>>
>> To me XSD choice naturally requires a Union feature of some sort.
>> If that's expermental still in Drill ... what to do?
>>
>> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre <
>> notifications@github.com>
>> wrote:
>>
>> @mbeckerle <https://github.com/mbeckerle>
>> You've encountered another challenge that exists in Drill reading data
>> without a schema.
>> Let me explain a bit about this and I'm going to use the JSON reader as an
>> example. First Drill requires data to be homogeneous. Drill does have a
>> Union vector type which allows heterogeneous data however this is a bit
>> experimental and I wouldn't recommend using it. Also, it really just
>> shifts
>> schema inconsistencies to the user.
>>
>> For instance, let's say you have a column consisting of strings and
>> floats. What happens if you try to do something like this:
>>
>> SELECT sum(mixed_col)-- orSELECT.... ORDER BY mixed_col
>>
>> Remembering that Drill is distributed and if you have a column with the
>> same name and you try to do these operations, they will fail.
>>
>> Let's say we have data like this:
>>
>> [
>>  {
>>     'col1': 'Hi there',
>>     'col2': 5.0
>>  },
>>  {
>>     'col1':True,
>>     'col2': 4,
>>     'col3': 'foo'
>>  }
>> ]
>>
>> In older versions of Drill, this kind of data, this would throw all kinds
>> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
>> <https://github.com/jnturton> submitted apache#2638
>> <https://github.com/apache/drill/pull/2638> which overhauled implicit
>> casting. What this meant for users is that col2 in the above would be
>> automatically cast to a FLOAT and col1 would be automatically cast to a
>> VARCHAR.
>>
>> However, when reading data the story is a little different. What we did
>> for the JSON reader was have several read modes. The least tolerant
>> attempts to infer all data types. This seems like a great idea in
>> practice,
>> however when you start actually using Drill with real data, you start
>> seeing the issues with this approach. The JSON reader has a few
>> configuration options that increase its tolerance for bad data. The next
>> level is readAllNumbersAsDouble which... as the name implies, reads all
>> numeric data as Doubles and does not attempt to infer ints vs floats. The
>> next options is allTextMode which reads all fields as VARCHAR. This
>> should be used when the data is so inconsistent that it cannot be read
>> with
>> either mode. These modes can be set globally, at the plugin level or at
>> query time.
>>
>> For the XML reader, I didn't add type inference because I figured the data
>> would be quite messy, however it wouldn't be that hard to add basically
>> the
>> same levels as the JSON reader.
>>
>> This fundamental issue exists in all the readers that read data without a
>> schema. My rationale for working on the XSD reader is that this will
>> enable
>> us to accurately read XML data with all the correct data types.
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <https://github.com/cgivre/drill/pull/6#issuecomment-1666875922>, or
>> unsubscribe
>> <
>> https://github.com/notifications/unsubscribe-auth/AALUDAZZ6T6Z44AW44IKD2LXT6RVNANCNFSM6AAAAAA26ZZVQ4
>> >
>> .
>> You are receiving this because you were mentioned.Message ID:
>> <cg...@github.com>
>>
>>
>>