You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Joe Harris <jo...@gmail.com> on 2016/05/18 00:54:47 UTC

WHERE clause comparison to maps or arrays not possible?

I'm querying parquet data which contains large numbers of empty maps/arrays that I'd like to eliminate from the query as early as possible. The events I'm querying are very sparse in the data.

> SELECT par.check.detections.list AS detections FROM `/my.parquet/year=2016/month=5` par LIMIT 10;
10 rows selected (3.157 seconds)
+-----------------------------+
| {"detections":{"list":[]}}  |
+-----------------------------+


I've tried to use the map directly but the query fails.

> SELECT par.check AS check 
> FROM `/my.parquet/year=2016/month=5` par
> WHERE par.check <> '{"detections":{"list":[]}}' ;
Error in expression at index -1.  Error: Missing function implementation: [equal(MAP-REQUIRED, VARCHAR-REQUIRED)].  Full expression: --UNKNOWN EXPRESSION--..


I get a similar result if I reference down to the empty array.

> SELECT par.check.detections.list AS detections 
> FROM `/my.parquet/year=2016/month=5` par
> WHERE par.check.detections.list <> '[]' ;
Error in expression at index -1.  Error: Missing function implementation: [equal(MAP-REPEATED, VARCHAR-REQUIRED)].  Full expression: --UNKNOWN EXPRESSION--..


I'm also unable to refer to an array element without using FLATTEN().

> SELECT par.check.detections.list[0]
> FROM `/my.parquet/year=2016/month=5` par;
Error: Unexpected RuntimeException: java.lang.NullPointerException (state=,code=0)


When I use FLATTEN() it automatically eliminates empty arrays and query time goes way up. It seems to be doing quite a bit of work on each empty array before eliminating it.

> SELECT  FLATTEN(par.check.detections.list) AS detections
> FROM   `/my.parquet/year=2016/month=5` par
> LIMIT 10;
No rows selected (290.475 seconds)


I can see in the docs that map and array are internal only data types that aren't exposed to the user (https://drill.apache.org/docs/supported-data-types/#composite-types). Is this on the roadmap somewhere to be added?




Thanks,
Joe



Re: WHERE clause comparison to maps or arrays not possible?

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Joe,

Not sure I fully get the picture on your data, but you may want to see if pointing at the first element in an array and looking for nulls works in your case.
I have used that in the past to eliminate records or elements from the result set in a subquery.

Perhaps try something like this

SELECT par.check 
FROM <workspace.dir>
WHERE par.check.detections.list[0] IS NOT NULL


FLATTEN may be very expensive for sparsely populated data. Might be good to capture statistics and file a JIRA to see if the function can be optimized for this use case.

--Andries





> On May 17, 2016, at 5:54 PM, Joe Harris <jo...@gmail.com> wrote:
> 
> I'm querying parquet data which contains large numbers of empty maps/arrays that I'd like to eliminate from the query as early as possible. The events I'm querying are very sparse in the data.
> 
>> SELECT par.check.detections.list AS detections FROM `/my.parquet/year=2016/month=5` par LIMIT 10;
> 10 rows selected (3.157 seconds)
> +-----------------------------+
> | {"detections":{"list":[]}}  |
> +-----------------------------+
> 
> 
> I've tried to use the map directly but the query fails.
> 
>> SELECT par.check AS check 
>> FROM `/my.parquet/year=2016/month=5` par
>> WHERE par.check <> '{"detections":{"list":[]}}' ;
> Error in expression at index -1.  Error: Missing function implementation: [equal(MAP-REQUIRED, VARCHAR-REQUIRED)].  Full expression: --UNKNOWN EXPRESSION--..
> 
> 
> I get a similar result if I reference down to the empty array.
> 
>> SELECT par.check.detections.list AS detections 
>> FROM `/my.parquet/year=2016/month=5` par
>> WHERE par.check.detections.list <> '[]' ;
> Error in expression at index -1.  Error: Missing function implementation: [equal(MAP-REPEATED, VARCHAR-REQUIRED)].  Full expression: --UNKNOWN EXPRESSION--..
> 
> 
> I'm also unable to refer to an array element without using FLATTEN().
> 
>> SELECT par.check.detections.list[0]
>> FROM `/my.parquet/year=2016/month=5` par;
> Error: Unexpected RuntimeException: java.lang.NullPointerException (state=,code=0)
> 
> 
> When I use FLATTEN() it automatically eliminates empty arrays and query time goes way up. It seems to be doing quite a bit of work on each empty array before eliminating it.
> 
>> SELECT  FLATTEN(par.check.detections.list) AS detections
>> FROM   `/my.parquet/year=2016/month=5` par
>> LIMIT 10;
> No rows selected (290.475 seconds)
> 
> 
> I can see in the docs that map and array are internal only data types that aren't exposed to the user (https://drill.apache.org/docs/supported-data-types/#composite-types). Is this on the roadmap somewhere to be added?
> 
> 
> 
> 
> Thanks,
> Joe
> 
>