You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Parth Chandra (JIRA)" <ji...@apache.org> on 2015/03/24 23:36:53 UTC
[jira] [Updated] (DRILL-1716) Nested Data : There should be an easy
way to apply aggregate functions on repeated types
[ https://issues.apache.org/jira/browse/DRILL-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Parth Chandra updated DRILL-1716:
---------------------------------
Fix Version/s: (was: 0.9.0)
Future
> Nested Data : There should be an easy way to apply aggregate functions on repeated types
> ----------------------------------------------------------------------------------------
>
> Key: DRILL-1716
> URL: https://issues.apache.org/jira/browse/DRILL-1716
> Project: Apache Drill
> Issue Type: New Feature
> Components: Functions - Drill, Storage - JSON
> Reporter: Rahul Challapalli
> Fix For: Future
>
>
> {code}
> {
> "company_id": 1,
> "evnts": [
> {
> "evnt_id": 999,
> "evnt_duration": 60
> },
> {
> "evnt_id": 998,
> "evnt_duration": 30
> },
> {
> "evnt_id": 997,
> "evnt_duration": 45
> }
> ]
> }
> {code}
> For the above dataset, if I want to find the longest duration for each company id, below is how I would do it now
> {code}
> select sub.company_id , max(sub.evnt.evnt_duration) max_duration
> from (
> select company_id, flatten(evnts) evnt from `nested.json`
> ) sub
> group by sub.company_id;
> +------------+--------------+
> | company_id | max_duration |
> +------------+--------------+
> | 1 | 60 |
> +------------+--------------+
> {code}
> Now if I want the evnt_id associated with the longest duration then we need one more join
> {code}
> select a.company_id, b.evnt.evnt_id
> from (
> select sub.company_id company_id, max(sub.evnt.evnt_duration) max_duration
> from (
> select company_id, flatten(evnts) evnt from `nested.json`
> ) sub
> group by sub.company_id
> ) a
> join
> (
> select flatten(evnts) evnt from `nested.json`
> ) b
> on a.max_duration = b.evnt.evnt_duration;
> {code}
> The above query currently fails (DRILL-1649). But that is how we have to do it currently.
> It would be much simpler if I can do something like the below
> {code}
> select company_id, nested_agg('evnts', 'max', 'evnt_duration','evnt_id') as evnt_id from `nested.json`;
> {code}
> Apart from making the query much simpler to write this might enhance drill's performance as well.
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)