You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by 毛洪玥 <ma...@yidian-inc.com> on 2020/01/21 07:21:58 UTC

Questions About Force Hit Cube or Hybrid Feature

Hi all,

   Recently we has gone live "Force Hit Cube or Hybrid" feature both back-end and front-end based on issue KYLIN-4312 which solved by @Xiaoxiang Yu, it will be available in next release according to the plan. And we got some questions as below:

[Background]
   After patch applied, Kylin website/UI looks like pic1(with a Drop-down box in "Insight" page to let user choose the cube for their query):

   There were two main use cases for this feature in our company:
       1. Force choose the cheapest cube. In our team, we choose to build some smaller cubes other than SINGLE larger cube to reduce build duration/cube storage. For example, we build three small cubes: the first cube with three dimensions "ABC", the second cube with three dimensions "ADE", and the third cube with five dimensions “ADHGF", rather than a bigger cube with eight dimensions "ABCDEHGF". We can see, because of the removal of cuboid "ABCDEHGF", our design will reduce total storage a lot in theory (however it depends on specific use scenarios). After that, the design we choose will cause new question. UserA create and build Cube1(with three dimensions A,B,C)  from 2020.01.07 to now, and UserB create and build Cube2 (with four dimensions A,D,H,G,F) from 2020.01.05 to now. When UserB querying "select A,count(*) from db.table group by A; " , this query will hit Cube1, because of less dimension/measure, so that result from 01.05 to 01.07 will disappear. To fix this problem, we have to force choose Cube2 to answer this query.
       2. For testing and debug purpose. We usually clone new cube from existing one, make some changes(maybe add some new configuration) and then build some new segment for testing new added feature. But it will cause cube conflict when two cube both become READY, thus leads to wrong online results(maybe misleads QA team).

[Questions]
        1. Will the design we choose in use case 1 cause other problem we didn't imagine? For example, build some smaller cube will take longer build duration and cost more YARN resource than a single larger cube?
        2. For online testing, I wonder if there exists some better solution?
        3. When a Cube was chosen focrely in this way, we can’t use Kylin’s auto cube route strategy any more, which will find the most suitable cube for query automatically. For use case 1，if we have Cube1(with three dimensions A,B,C) and Cube2(with four dimensions A,D,H,G,F) with the same segment, both Cube1 and Cube2 could answer a specific Query of "select A,count(*) from db.table where date=‘2020.01.08’ group by A", Cube2 will be chosen because we force hit it, but unfortunately Cube1 has less  dimension/measure, also maybe has the exact-match cuboid for this query, so we’d like to choose Cube1 for faster result rather than the Cube we force to hit. Is there a better solution for us to find the cheaper cube with right query result?

Re:Questions About Force Hit Cube or Hybrid Feature

Posted by Xiaoxiang Yu <xx...@apache.org>.

Hi hongyue,
Thank you for sharing the experience in your use case. I am glad to hear that we solved the problem to some extent by our effort and collaboration.
The Question 3 is really interesting, but finding a real smart solution maybe difficult(and maybe error-prone), wish someone could have a better idea in the future.

Best wishes to you !
From ：Xiaoxiang Yu

At 2020-01-21 15:21:58, "毛洪玥" <ma...@yidian-inc.com> wrote:

Hi all,

Recently we has gone live "Force Hit Cube or Hybrid" feature both back-end and front-end based on issue KYLIN-4312 which solved by @Xiaoxiang Yu, it will be available in next release according to the plan. And we got some questions as below:

[Background]
After patch applied, Kylin website/UI looks like pic1(with a Drop-down box in "Insight" page to let user choose the cube for their query):

There were two main use cases for this feature in our company:
1. Force choose the cheapest cube. In our team, we choose to build some smaller cubes other than SINGLE larger cube to reduce build duration/cube storage. For example, we build three small cubes: the first cube with three dimensions "ABC", the second cube with three dimensions "ADE", and the third cube with five dimensions “ADHGF", rather than a bigger cube with eight dimensions "ABCDEHGF". We can see, because of the removal of cuboid "ABCDEHGF", our design will reduce total storage a lot in theory (however it depends on specific use scenarios). After that, the design we choose will cause new question. UserA create and build Cube1(with three dimensions A,B,C) from 2020.01.07 to now, and UserB create and build Cube2 (with four dimensions A,D,H,G,F) from 2020.01.05 to now. When UserB querying "select A,count(*) from db.table group by A; " , this query will hit Cube1, because of less dimension/measure, so that result from 01.05 to 01.07 will disappear. To fix this problem, we have to force choose Cube2 to answer this query.
2. For testing and debug purpose. We usually clone new cube from existing one, make some changes(maybe add some new configuration) and then build some new segment for testing new added feature. But it will cause cube conflict when two cube both become READY, thus leads to wrong online results(maybe misleads QA team).

[Questions]
1. Will the design we choose in use case 1 cause other problem we didn't imagine? For example, build some smaller cube will take longer build duration and cost more YARN resource than a single larger cube?
2. For online testing, I wonder if there exists some better solution?
3. When a Cube was chosen focrely in this way, we can’t use Kylin’s auto cube route strategy any more, which will find the most suitable cube for query automatically. For use case 1，if we have Cube1(with three dimensions A,B,C) and Cube2(with four dimensions A,D,H,G,F) with the same segment, both Cube1 and Cube2 could answer a specific Query of "select A,count(*) from db.table where date=‘2020.01.08’ group by A", Cube2 will be chosen because we force hit it, but unfortunately Cube1 has less dimension/measure, also maybe has the exact-match cuboid for this query, so we’d like to choose Cube1 for faster result rather than the Cube we force to hit. Is there a better solution for us to find the cheaper cube with right query result?