You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@doris.apache.org by 杨文波 <ya...@163.com> on 2020/09/27 02:36:38 UTC

[Proposal] Support approximate TopN function for Doris

Is your feature request related to a problem? Please describe.
It is a common scenario for a database to calculate top-n from a dataset, sort a column and return the Top elements. TopN could uses an approximation algorithms to quickly return results with little overhead.

There are many algorithms to quickly calculate frequent items from the data set, among which the SpaceSaving algorithm has a good performance for reference [1]. Kylin and PostgresSQL have implemented TopN approximation algorithms based on this algorithm and have performed well. We can implement this algorithm in Doris to compute TopN quickly.

Describe the solution you'd like
Kylin's current implementation computes the topN of a measure column based on dimensional columns, such like group by dimension_cols order by measure_column limit 10 (refer https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/) while PostgresSQL uses the topN function to computes the frequent entries of a column (refer https://github.com/citusdata/postgresql-topn). 

For the current Doris, the lack of UDTF support makes it difficult to calculate topN on dimension columns and measure column like Kylin. So it is more appropriate to calculate topN using an aggregate function like PostgresSQL, and present the results in JSON format.

eg:

CREATE TABLE popular_ad
(
  event_date date NOT NULL ,
  ad_id BIGINT NOY NULL
);


For example, we can calculate the ad_id that is displayed or clicked most per day.

select event_date, topn(ad_id, 1)  as top_1 from popular_ad group by event_date;

               event_date         |      top_1
----------------------------------+--------------------
2020-01-01                        |  {"xxxx":"10000"}
2020-01-02                        |  {"xxx":"11111"}


The result represents the top1 item and its corresponding frequency.

In the future, with Doris supporting UDTF, we can display item and frequency based on this function, or topN for the SUM aggregate type column group by key columns

Additional context
[1] Efficient Computation of Frequent and Top-k Elements in Data Streams by Metwally, Agrawal, and Abbadi.

Re:[Proposal] Support approximate TopN function for Doris

Posted by 陈明雨 <mo...@163.com>.

Hi Wenbo:


Nice proposal! Waiting for your PR!.
I have a question that the TopN result is shown as Json?




--

此致！Best Regards
陈明雨 Mingyu Chen

Email:
chenmingyu@apache.org





At 2020-09-27 10:36:38, "杨文波" <ya...@163.com> wrote:
>Is your feature request related to a problem? Please describe.
>It is a common scenario for a database to calculate top-n from a dataset, sort a column and return the Top elements. TopN could uses an approximation algorithms to quickly return results with little overhead.
>
>There are many algorithms to quickly calculate frequent items from the data set, among which the SpaceSaving algorithm has a good performance for reference [1]. Kylin and PostgresSQL have implemented TopN approximation algorithms based on this algorithm and have performed well. We can implement this algorithm in Doris to compute TopN quickly.
>
>Describe the solution you'd like
>Kylin's current implementation computes the topN of a measure column based on dimensional columns, such like group by dimension_cols order by measure_column limit 10 (refer https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/) while PostgresSQL uses the topN function to computes the frequent entries of a column (refer https://github.com/citusdata/postgresql-topn). 
>
>For the current Doris, the lack of UDTF support makes it difficult to calculate topN on dimension columns and measure column like Kylin. So it is more appropriate to calculate topN using an aggregate function like PostgresSQL, and present the results in JSON format.
>
>eg:
>
>CREATE TABLE popular_ad
>(
>  event_date date NOT NULL ,
>  ad_id BIGINT NOY NULL
>);
>
>
>For example, we can calculate the ad_id that is displayed or clicked most per day.
>
>select event_date, topn(ad_id, 1)  as top_1 from popular_ad group by event_date;
>
>               event_date         |      top_1
>----------------------------------+--------------------
>2020-01-01                        |  {"xxxx":"10000"}
>2020-01-02                        |  {"xxx":"11111"}
>
>
>The result represents the top1 item and its corresponding frequency.
>
>In the future, with Doris supporting UDTF, we can display item and frequency based on this function, or topN for the SUM aggregate type column group by key columns
>
>Additional context
>[1] Efficient Computation of Frequent and Top-k Elements in Data Streams by Metwally, Agrawal, and Abbadi.