You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by 寒香 <10...@qq.com> on 2020/05/15 05:17:41 UTC

问题咨询

大家好:
我们现在有一个业务需求,大致是从大量数据中筛选出可以同时满足多个规则的子数据集。不同的场景下会有不同的多个规则并且比较复杂,比如数据来源的单个城市占比不能超过15%(当然这个15%是可以按需调整的)、各种通过计算得到的业务值占比不超过某特定值,诸如此类。想请教下可以通过Apache Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料?是否需要借助工具完成?谢谢。


Hello,everyone:
Now we have a business requirement, which is to filter out sub datasets from a large number of data that can meet multiple rules at the same time. In different scenarios, there will be different and complex rules. For example, the proportion of a single city in the data source cannot exceed 15% (of course, 15% can be adjusted on demand by users), the proportion of various calculated business values does not exceed a specific value, and so on. I want to know, can we resolve this requirement by Apache Kylin? What plan should be adopted if possible? Is there any information or demo for reference? Does it need to be done with other tools?Thanks a lot.

回复: 问题咨询

Posted by 寒香 <10...@qq.com>.
I not think that this requirement has something to do with threshold query.How do match multiple rules at the same time is my dying to know actually.


------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"big data"<bigdatabase@outlook.com&gt;;
发送时间:&nbsp;2020年5月18日(星期一) 下午4:53
收件人:&nbsp;"user"<user@kylin.apache.org&gt;;

主题:&nbsp;Re: 问题咨询



           
Maybe it is a kind of the threshold query. You can google it for       much info.
     
     在 2020/5/18 下午3:51, 寒香 写道:
     
                            Shaofeng Shi 史少锋:
         &nbsp; &nbsp; 您好。感谢您之前的答复。
         &nbsp; &nbsp; 我们的需求目前尚无表结构设计,不过我可以先举个例子来说明我们的需求。比如father表中有几个字段:主键key、城市city、金额amt、业务字段a、业务字段b、业务字段c,假设数据有1亿条,需求是找出一个子数据集child同时满足下面几个规则:
         &nbsp; &nbsp; 1) 子数据集至少包含所有数据量的10%(select count(1) from child /           select count(1) from father ≥ 10%);
         &nbsp; &nbsp; 2) 子数据集中单个城市所属的省份占全部子数据集的占比不能低于5%(select           sum(province(city)) from child group by province(city) /           select sum(province(city)) from father group by province(city)           ≥ 5%,相同province间比例,结果集中任一province都要满足,province(city)是根据city获取对应province的udf),单个城市占全部子数据集的占比不能低于1%(select           sum(city) from child group by city / select sum(city) from           father group by city,相同city间比例,结果集中任一city都要满足);
         &nbsp; &nbsp; 3) 子数据集所有数据的金额之和在100亿±10%以内范围,即需要在90亿~110亿之间(90亿 ≤           select sum(amt) from child ≤ 110亿);
         &nbsp; &nbsp; 4) 每条记录的某业务参数至少20%(select a / (a+b+c) from child&nbsp; ≥           20%, select b / (a+b+c) from child&nbsp; ≥ 20%, select c / (a+b+c)           from child&nbsp; ≥ 20%)。这个规则只是说明有这种记录内的计算值的需求,不必过分关注a、b、c的具体含义。
       
       &nbsp; &nbsp; 上面提到的比例阈值要求可以按需更改,并且每种需求都需要满足一个或多个特定的规则。想请教下可以通过Apache         Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料或示例?是否需要借助其他工具完成?谢谢。
                
         
         
         
         ------------------ 原始邮件 ------------------
                    发件人:&nbsp;"ShaoFeng Shi"<shaofengshi@apache.org&gt;;
           发送时间:&nbsp;2020年5月16日(星期六) 中午11:24
           收件人:&nbsp;"user"<user@kylin.apache.org&gt;;
           主题:&nbsp;Re: 问题咨询
         
         
         
         Hi Xiang,           
           
           I'm not sure whether Kylin can help; Does Hive/Spark SQL             can fullfill the requirement? If you can provide a couple of             SQL queries, that would help us to see whether Kylin can             help.
           
           
                                                                                                                                                                           Best regards,                             
                             
                             Shaofeng Shi 史少锋
                             Apache Kylin PMC
                             Email: shaofengshi@apache.org
                             
                             
                             Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
                             Join Kylin user mail group: user-subscribe@kylin.apache.org
                             Join Kylin dev mail group: dev-subscribe@kylin.apache.org
                             
                             
                             
                             
                             
                           
                         
                       
                     
                   
                 
               
             
             
           
         
         
                    寒香 <1014631446@qq.com&gt;             于2020年5月15日周五 下午1:18写道:
           
                        大家好:
             我们现在有一个业务需求,大致是从大量数据中筛选出可以同时满足多个规则的子数据集。不同的场景下会有不同的多个规则并且比较复杂,比如数据来源的单个城市占比不能超过15%(当然这个15%是可以按需调整的)、各种通过计算得到的业务值占比不超过某特定值,诸如此类。想请教下可以通过Apache               Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料?是否需要借助工具完成?谢谢。
             
             
             Hello,everyone:
             Now we have a business requirement, which is to filter               out sub datasets from a large number of data that can meet               multiple rules at the same time. In different scenarios,               there will be different and complex rules. For example,               the proportion of a single city in the data source cannot               exceed 15% (of course, 15% can be adjusted on demand by               users), the proportion of various calculated business               values does not exceed a specific value, and so on. I want               to know, can we resolve this requirement by Apache Kylin?               What plan should be adopted if possible? Is there any               information or demo for reference? Does it need to be done               with other tools?Thanks a lot.

Re: 问题咨询

Posted by big data <bi...@outlook.com>.
Maybe it is a kind of the threshold query. You can google it for much info.

在 2020/5/18 下午3:51, 寒香 写道:
> Shaofeng Shi 史少锋:
>     您好。感谢您之前的答复。
> 我们的需求目前尚无表结构设计,不过我可以先举个例子来说明我们的需求。比如father表中有几个字段:主键key、城市city、金额amt、业务字段a、业务字段b、业务字段c,假设数据有1亿条,需求是找出一个子数据集child同时满足下面几个规则:
>     1) 子数据集至少包含所有数据量的10%(select count(1) from child / 
> select count(1) from father ≥ 10%);
>     2) 子数据集中单个城市所属的省份占全部子数据集的占比不能低于5%(select 
> sum(province(city)) from child group by province(city) / select 
> sum(province(city)) from father group by province(city) ≥ 
> 5%,相同province间比例,结果集中任一province都要满足,province(city)是根据city获取对应province的udf),单个城市占全部子数据集的占比不能低于1%(select 
> sum(city) from child group by city / select sum(city) from father 
> group by city,相同city间比例,结果集中任一city都要满足);
>     3) 
> 子数据集所有数据的金额之和在100亿±10%以内范围,即需要在90亿~110亿之间(90亿 
> ≤ select sum(amt) from child ≤ 110亿);
>     4) 每条记录的某业务参数至少20%(select a / (a+b+c) from child  ≥ 20%, 
> select b / (a+b+c) from child  ≥ 20%, select c / (a+b+c) from child  ≥ 
> 20%)。这个规则只是说明有这种记录内的计算值的需求,不必过分关注a、b、c的具体含义。
>     
> 上面提到的比例阈值要求可以按需更改,并且每种需求都需要满足一个或多个特定的规则。想请教下可以通过Apache 
> Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料或示例?是否需要借助其他工具完成?谢谢。
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> *发送时间:* 2020年5月16日(星期六) 中午11:24
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: 问题咨询
>
> Hi Xiang,
>
> I'm not sure whether Kylin can help; Does Hive/Spark SQL can fullfill 
> the requirement? If you can provide a couple of SQL queries, that 
> would help us to see whether Kylin can help.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org <ma...@apache.org>
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org 
> <ma...@kylin.apache.org>
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org 
> <ma...@kylin.apache.org>
>
>
>
>
> 寒香 <1014631446@qq.com <ma...@qq.com>> 于2020年5月15日周五 
> 下午1:18写道:
>
>     大家好:
>     我们现在有一个业务需求,大致是从大量数据中筛选出可以同时满足多个规则的子数据集。不同的场景下会有不同的多个规则并且比较复杂,比如数据来源的单个城市占比不能超过15%(当然这个15%是可以按需调整的)、各种通过计算得到的业务值占比不超过某特定值,诸如此类。想请教下可以通过Apache
>     Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料?是否需要借助工具完成?谢谢。
>
>     Hello,everyone:
>     Now we have a business requirement, which is to filter out sub
>     datasets from a large number of data that can meet multiple rules
>     at the same time. In different scenarios, there will be different
>     and complex rules. For example, the proportion of a single city in
>     the data source cannot exceed 15% (of course, 15% can be adjusted
>     on demand by users), the proportion of various calculated business
>     values does not exceed a specific value, and so on. I want to
>     know, can we resolve this requirement by Apache Kylin? What plan
>     should be adopted if possible? Is there any information or demo
>     for reference? Does it need to be done with other tools?Thanks a lot.
>
>

问题咨询

Posted by 寒香 <10...@qq.com>.
Shaofeng Shi 史少锋:
&nbsp; &nbsp; 您好。感谢您之前的答复。
&nbsp; &nbsp; 我们的需求目前尚无表结构设计,不过我可以先举个例子来说明我们的需求。比如father表中有几个字段:主键key、城市city、金额amt、业务字段a、业务字段b、业务字段c,假设数据有1亿条,需求是找出一个子数据集child同时满足下面几个规则:
&nbsp; &nbsp; 1) 子数据集至少包含所有数据量的10%(select count(1) from child / select count(1) from father ≥ 10%);
&nbsp; &nbsp; 2) 子数据集中单个城市所属的省份占全部子数据集的占比不能低于5%(select sum(province(city)) from child group by province(city) / select sum(province(city)) from father group by province(city) ≥ 5%,相同province间比例,结果集中任一province都要满足,province(city)是根据city获取对应province的udf),单个城市占全部子数据集的占比不能低于1%(select sum(city) from child group by city / select sum(city) from father group by city,相同city间比例,结果集中任一city都要满足);
&nbsp; &nbsp; 3) 子数据集所有数据的金额之和在100亿±10%以内范围,即需要在90亿~110亿之间(90亿 ≤ select sum(amt) from child ≤ 110亿);
&nbsp; &nbsp; 4) 每条记录的某业务参数至少20%(select a / (a+b+c) from child&nbsp; ≥ 20%, select b / (a+b+c) from child&nbsp; ≥ 20%, select c / (a+b+c) from child&nbsp; ≥ 20%)。这个规则只是说明有这种记录内的计算值的需求,不必过分关注a、b、c的具体含义。

&nbsp; &nbsp; 上面提到的比例阈值要求可以按需更改,并且每种需求都需要满足一个或多个特定的规则。想请教下可以通过Apache Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料或示例?是否需要借助其他工具完成?谢谢。




------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"ShaoFeng Shi"<shaofengshi@apache.org&gt;;
发送时间:&nbsp;2020年5月16日(星期六) 中午11:24
收件人:&nbsp;"user"<user@kylin.apache.org&gt;;

主题:&nbsp;Re: 问题咨询



Hi Xiang,

I'm not sure whether Kylin can help; Does Hive/Spark SQL can fullfill the requirement? If you can provide a couple of SQL queries, that would help us to see whether Kylin can help.


Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ:&nbsp;https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

















寒香 <1014631446@qq.com&gt; 于2020年5月15日周五 下午1:18写道:

大家好:
我们现在有一个业务需求,大致是从大量数据中筛选出可以同时满足多个规则的子数据集。不同的场景下会有不同的多个规则并且比较复杂,比如数据来源的单个城市占比不能超过15%(当然这个15%是可以按需调整的)、各种通过计算得到的业务值占比不超过某特定值,诸如此类。想请教下可以通过Apache Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料?是否需要借助工具完成?谢谢。


Hello,everyone:
Now we have a business requirement, which is to filter out sub datasets from a large number of data that can meet multiple rules at the same time. In different scenarios, there will be different and complex rules. For example, the proportion of a single city in the data source cannot exceed 15% (of course, 15% can be adjusted on demand by users), the proportion of various calculated business values does not exceed a specific value, and so on. I want to know, can we resolve this requirement by Apache Kylin? What plan should be adopted if possible? Is there any information or demo for reference? Does it need to be done with other tools?Thanks a lot.

Re: 问题咨询

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Xiang,

I'm not sure whether Kylin can help; Does Hive/Spark SQL can fullfill the
requirement? If you can provide a couple of SQL queries, that would help us
to see whether Kylin can help.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




寒香 <10...@qq.com> 于2020年5月15日周五 下午1:18写道:

> 大家好:
> 我们现在有一个业务需求,大致是从大量数据中筛选出可以同时满足多个规则的子数据集。不同的场景下会有不同的多个规则并且比较复杂,比如数据来源的单个城市占比不能超过15%(当然这个15%是可以按需调整的)、各种通过计算得到的业务值占比不超过某特定值,诸如此类。想请教下可以通过Apache
> Kylin来解决吗?可以的话应该采取什么方案,有没有可供参考的资料?是否需要借助工具完成?谢谢。
>
> Hello,everyone:
> Now we have a business requirement, which is to filter out sub datasets
> from a large number of data that can meet multiple rules at the same time.
> In different scenarios, there will be different and complex rules. For
> example, the proportion of a single city in the data source cannot exceed
> 15% (of course, 15% can be adjusted on demand by users), the proportion of
> various calculated business values does not exceed a specific value, and so
> on. I want to know, can we resolve this requirement by Apache Kylin? What
> plan should be adopted if possible? Is there any information or demo for
> reference? Does it need to be done with other tools?Thanks a lot.
>
>
>