You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Parth Chandra (JIRA)" <ji...@apache.org> on 2015/03/24 23:36:53 UTC

[jira] [Updated] (DRILL-1716) Nested Data : There should be an easy way to apply aggregate functions on repeated types

     [ https://issues.apache.org/jira/browse/DRILL-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Parth Chandra updated DRILL-1716:
---------------------------------
    Fix Version/s:     (was: 0.9.0)
                   Future

> Nested Data : There should be an easy way to apply aggregate functions on repeated types
> ----------------------------------------------------------------------------------------
>
>                 Key: DRILL-1716
>                 URL: https://issues.apache.org/jira/browse/DRILL-1716
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Functions - Drill, Storage - JSON
>            Reporter: Rahul Challapalli
>             Fix For: Future
>
>
> {code}
> {
>     "company_id": 1,
>     "evnts": [
>         {
>             "evnt_id": 999,
>             "evnt_duration": 60
>         },
>         {
>             "evnt_id": 998,
>             "evnt_duration": 30
>         },
>         {
>             "evnt_id": 997,
>             "evnt_duration": 45
>         }
>     ]
> }
> {code}
> For the above dataset, if I want to find the longest duration for each company id, below is how I would do it now
> {code}
> select sub.company_id , max(sub.evnt.evnt_duration) max_duration 
> from ( 
>     select company_id, flatten(evnts) evnt from `nested.json`
> ) sub 
> group by sub.company_id;
> +------------+--------------+
> | company_id | max_duration |
> +------------+--------------+
> | 1          | 60           |
> +------------+--------------+
> {code}
> Now if I want the evnt_id associated with the longest duration then we need one more join
> {code}
> select a.company_id, b.evnt.evnt_id 
> from (
>     select sub.company_id company_id, max(sub.evnt.evnt_duration) max_duration 
>     from ( 
>         select company_id, flatten(evnts) evnt from `nested.json`
>     ) sub 
>     group by sub.company_id
> ) a 
> join 
> (
>     select flatten(evnts) evnt from `nested.json`
> ) b 
> on a.max_duration = b.evnt.evnt_duration;
> {code}
> The above query currently fails (DRILL-1649). But that is how we have to do it currently.
> It would be much simpler if I can do something like the below
> {code}
> select company_id, nested_agg('evnts', 'max', 'evnt_duration','evnt_id') as evnt_id from `nested.json`;
> {code}
> Apart from making the query much simpler to write this might enhance drill's performance as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)