You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Xiaoxiang Yu <xx...@apache.org> on 2020/12/15 14:24:00 UTC

[Discuss] Support Cube Planner Phase One for Kylin 4

Hello, Kylin users,
Here is my proposal of implementing cube planner phase one for Kylin 4, and this is the link(https://cwiki.apache.org/confluence/display/KYLIN/KIP-3+Support+Cube+Planner+Phase+One+for+Kylin+4). If you have any suggestion, please let me know, thank you.

KIP-3 Support Cube Planner Phase One for Kylin 4

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.
Q2. What problem is this proposal NOT designed to solve?
Q3. How is it done today, and what are the limits of current practice?
Q4. What is new in your approach and why do you think it will be successful?
Q5. Who cares? If you are successful, what difference will it make?
Q6. What are the risks?
Q7. How long will it take?
Q8. How it works?
Reference

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

In Apache Kylin 4, Kylin team have implemented/developed new build engine and new query engine to provide better performance, please refer to KIP-1: Parquet storage if you are interested. But the current cuboid prune tools(Cube Planner) is not incompatible with new build engine, so I want to make new build engine support Cube Planner.

Q2. What problem is this proposal NOT designed to solve?

I am not going to support Cube Planner phase 2 at the moment, because phase 2 depend on some metrics in CubeVisitService.java(aggRowCount & totalRowCount)to infer row count of unbuilt/new cuboid. HBase storage is removed in Kylin 4, so we have find a another way to infer row count for unbuilt/new cuboid. Besides, System Cube(or metrics system) need to be refactored and metrics in METRICS_QUERY_RPC is deprecated because storage is changed(we don't have HBase's region server any more).

Q3. How is it done today, and what are the limits of current practice?
It is almost done in my patch, please check or review my patch at https://github.com/apache/kylin/pull/1485 .
Add a new step to calculate cuboid's HyperHyperLog did degrade build performance slightly, and it looks acceptable to me.
Q4. What is new in your approach and why do you think it will be successful?
It is not a new way, main logic of new added code looks like the original one in FactDistinctColumnsMapper.java .
We know that Cube Planner phase 1 depend on row count of each cuboid to calculate BPUS(benefit per unit space). By introduce a new step which will calcualte HyperLogLog for each candidate cuboid, we can enable Cube Planner phase 1 now.
Q5. Who cares? If you are successful, what difference will it make?

After this task is done, Kylin 4 will support Cube Planner phase 1, and make cuboid prune much easier than current state(didn't support ).

Q6. What are the risks?

So far so good.

Q7. How long will it take?

I have spent about three weeks to read original source code, write my code and test it. It is almost done.

Q8. How it works?
Use Spark to calculate cuboid's HllCounter for the first segment and persist into HDFS.
Re-enable Cube planner by default, but not support cube planner phase two.
Not merge cuboid statistics(HLLCounter) when merge segment.
By default, only calculate cuboid statistics for the FIRST segment. (No necessary becuase phase two is not supported )
Cuboid statistics for HLLCounter use precision 14.
Calculate cuboid statistics use 100% input flat table data. (Maybe use sample for input RDD in the future.)
Reference
https://github.com/apache/kylin/pull/1485

Best wishes to you !
From ：Xiaoxiang Yu

Re: [Discuss] Support Cube Planner Phase One for Kylin 4

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Xiaoxiang,

I like Cube planner, and so I also like this proposal. I want to confirm
that:

1) The Cube planner phase two optimization will be re-designed/implemented
based on the new query/storage engine later, right? If true, what's the
plan?
2) As the phase two optimization needs the cube statistics for all
segments, "only calculate cuboid statistics for the FIRST segment" may not
fulfill that. If we want the user can smoothly upgrade to the new version,
without rebuilding the cubes, you'd better calculate the statistics for all
segments from 4.0 very beginning.
3) For the cube statistics files, in the past Kylin persists that in the
metastore (HBase or MySQL); As time goes on, that takes a lot of space in
metadata, which causing many stability issues (e.g., metadata
backup/restore). In the new version, if move it out of the metadata to DFS,
that will make Kylin more stable.

Just my two cents, thanks for your effort!

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Xiaoxiang Yu <xx...@apache.org> 于2020年12月15日周二 下午10:25写道：

> Hello, Kylin users,
>     Here is my proposal of implementing cube planner phase one for Kylin
> 4, and this is the link(
> https://cwiki.apache.org/confluence/display/KYLIN/KIP-3+Support+Cube+Planner+Phase+One+for+Kylin+4).
> If you have any suggestion, please let me know, thank you.
>
>
>
>
> KIP-3 Support Cube Planner Phase One for Kylin 4
>
>
> Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.
> Q2. What problem is this proposal NOT designed to solve?
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach and why do you think it will be
> successful?
> Q5. Who cares? If you are successful, what difference will it make?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. How it works?
> Reference
>
> Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.
>
> In Apache Kylin 4, Kylin team have implemented/developed new build engine
> and new query engine to provide better performance, please refer to KIP-1:
> Parquet storage if you are interested. But the current cuboid prune
> tools(Cube Planner) is not incompatible with new build engine, so I want to
> make new build engine support Cube Planner.
>
> Q2. What problem is this proposal NOT designed to solve?
>
> I am not going to support Cube Planner phase 2 at the moment, because
> phase 2 depend on some metrics in CubeVisitService.java(aggRowCount &
> totalRowCount)to infer row count of unbuilt/new cuboid. HBase storage is
> removed in Kylin 4, so we have find a another way to infer row count for
> unbuilt/new cuboid. Besides, System Cube(or metrics system) need to be
> refactored and metrics in  METRICS_QUERY_RPC is deprecated because storage
> is changed(we don't have HBase's region server any more).
>
> Q3. How is it done today, and what are the limits of current practice?
> It is almost done in my patch, please check or review my patch at
> https://github.com/apache/kylin/pull/1485 .
> Add a new step to calculate cuboid's HyperHyperLog did degrade build
> performance slightly, and it looks acceptable to me.
> Q4. What is new in your approach and why do you think it will be
> successful?
> It is not a new way, main logic of new added code looks like the original
> one in FactDistinctColumnsMapper.java .
> We know that Cube Planner phase 1 depend on row count of each cuboid to
> calculate BPUS(benefit per unit space). By introduce a new step which will
> calcualte HyperLogLog for each candidate cuboid, we can enable Cube Planner
> phase 1 now.
> Q5. Who cares? If you are successful, what difference will it make?
>
> After this task is done, Kylin 4 will support Cube Planner phase 1, and
> make cuboid prune much easier than current state(didn't support ).
>
> Q6. What are the risks?
>
> So far so good.
>
> Q7. How long will it take?
>
> I have spent about three weeks to read original source code, write my code
> and test it. It is almost done.
>
> Q8. How it works?
> Use Spark to calculate cuboid's HllCounter for the first segment and
> persist into HDFS.
> Re-enable Cube planner by default, but not support cube planner phase two.
> Not merge cuboid statistics(HLLCounter) when merge segment.
> By default, only calculate cuboid statistics for the FIRST segment. (No
> necessary becuase phase two is not supported )
> Cuboid statistics for HLLCounter use precision 14.
> Calculate cuboid statistics use 100% input flat table data. (Maybe use
> sample for input RDD in the future.)
> Reference
> https://github.com/apache/kylin/pull/1485
>
> --
>
> Best wishes to you !
> From ：Xiaoxiang Yu