You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2016/05/12 17:53:13 UTC
[jira] [Commented] (HIVE-13019) Optimizer COLLECT_LIST/COLLECT_SET
[ https://issues.apache.org/jira/browse/HIVE-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281836#comment-15281836 ]
Ruslan Dautkhanov commented on HIVE-13019:
------------------------------------------
[~gopalv], HIVE-13076 was resolved yesterday. Hive now has FK/PK .. would it be possible to go ahead with this optimization?
Thanks.
> Optimizer COLLECT_LIST/COLLECT_SET
> -----------------------------------
>
> Key: HIVE-13019
> URL: https://issues.apache.org/jira/browse/HIVE-13019
> Project: Hive
> Issue Type: Improvement
> Components: CBO, Logical Optimizer
> Reporter: Dustin Cote
> Priority: Minor
>
> Currently when using a COLLECT_SET/COLLECT_LIST that involves data from a single table, the aggregation is done after any JOIN operation that is present in the query. For example:
> {code}
> insert into table nested_customers_orders
> select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...))
> from customers c inner join orders o on (c.cid = o.oid)
> group by o.oid, o.date,...
> {code}
> If we can tell the optimizer to perform the COLLECT_LIST first (where possible) we can see some performance gains in this pattern of query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)