You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <jh...@apache.org> on 2017/07/06 20:44:49 UTC

Re: Questions about MonteCarloAlgorithm to compute tiles of Lattice

> On Jun 28, 2017, at 10:58 PM, weijie tong <to...@gmail.com> wrote:
> 
> HI all:
>   anyone can explain  the detail of the MonteCarlo algorithm to compute
> the tiles of a Lattice?
>    It seems that  MonteCarlo algorithm will simulate every possible query
> of all kind of  AggregateImpls ,and will choose the lowest cost's ( cost
> model determined by the estimateCost() method of LatticeImpl )
> AggregateImpl to be the titles.

ExhaustiveLatticeAlgorithm will try every possible query (2^n if there are n attributes), whereas MonteCarloAlgorithm tries a set of random queries.

Both algorithms are greedy algorithms. Each iteration, they assume that a set of aggregates have been chosen, and choose the best aggregate to add to it by calling getBenefit (which, despite its name, is a cost-benefit ratio). Repeat until there are enough aggregates.

>    I also find that the cost benefits of the choose AggregateImpls don't
> play any role to the final output AggregateImpl.

If you’re referring to the list of CostBenefit objects created at the end of the algorithm; yes, they are just info to put on the screen and prove that the algorithm has done a great job.

But you’ll see that getBenefit is called in the inner loop.

>    please correct my opinion and show me the mathematical theory of the
> MonteCarlo algorithm to  choose the best  aggregates .

MonteCarloAlgorithm could be improved by taking into account historic queries, but I think it does a good job for the case where no previous queries are not known.

The biggest problem with the algorithm is the amount of time spent gathering statistics. My work on data profiling [1] [2] will speed up getBenefit hugely because it will be able to answer aggregate.estimateRowCount() without executing a query.

Julian

[1] https://issues.apache.org/jira/browse/CALCITE-1616 <https://issues.apache.org/jira/browse/CALCITE-1616>

[2] https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite <https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite>

Re: Questions about MonteCarloAlgorithm to compute tiles of Lattice

Posted by weijie tong <to...@gmail.com>.

additional remarks,
I mean if we do not follow the Star schema ( maybe two fact tables to join,
both with dimension columns) ,we only materialize the provided defined
tiles of a cube ( let algorithm option to be false), then a user's join
query will  correctly be transferred to the materialized cube's query ?

On Mon, Jul 10, 2017 at 7:22 PM, weijie tong <to...@gmail.com>
wrote:

> @Julian thanks for your reply.
> Another question is about `Star schema` requirement. Does this
> precondition only affect `Lattice.computeTiles()` method to choose the
> right dimension group to be the candidate Tile ?
>
> On Fri, Jul 7, 2017 at 4:44 AM, Julian Hyde <jh...@apache.org> wrote:
>
>>
>> > On Jun 28, 2017, at 10:58 PM, weijie tong <to...@gmail.com>
>> wrote:
>> >
>> > HI all:
>> >   anyone can explain  the detail of the MonteCarlo algorithm to compute
>> > the tiles of a Lattice?
>> >    It seems that  MonteCarlo algorithm will simulate every possible
>> query
>> > of all kind of  AggregateImpls ,and will choose the lowest cost's ( cost
>> > model determined by the estimateCost() method of LatticeImpl )
>> > AggregateImpl to be the titles.
>>
>> ExhaustiveLatticeAlgorithm will try every possible query (2^n if there
>> are n attributes), whereas MonteCarloAlgorithm tries a set of random
>> queries.
>>
>> Both algorithms are greedy algorithms. Each iteration, they assume that a
>> set of aggregates have been chosen, and choose the best aggregate to add to
>> it by calling getBenefit (which, despite its name, is a cost-benefit
>> ratio). Repeat until there are enough aggregates.
>>
>>
>> >    I also find that the cost benefits of the choose AggregateImpls don't
>> > play any role to the final output AggregateImpl.
>>
>> If you’re referring to the list of CostBenefit objects created at the end
>> of the algorithm; yes, they are just info to put on the screen and prove
>> that the algorithm has done a great job.
>>
>> But you’ll see that getBenefit is called in the inner loop.
>>
>>
>> >    please correct my opinion and show me the mathematical theory of the
>> > MonteCarlo algorithm to  choose the best  aggregates .
>>
>> MonteCarloAlgorithm could be improved by taking into account historic
>> queries, but I think it does a good job for the case where no previous
>> queries are not known.
>>
>> The biggest problem with the algorithm is the amount of time spent
>> gathering statistics. My work on data profiling [1] [2] will speed up
>> getBenefit hugely because it will be able to answer
>> aggregate.estimateRowCount() without executing a query.
>>
>> Julian
>>
>> [1] https://issues.apache.org/jira/browse/CALCITE-1616 <
>> https://issues.apache.org/jira/browse/CALCITE-1616>
>>
>> [2] https://www.slideshare.net/julianhyde/data-profiling-with-
>> apache-calcite <https://www.slideshare.net/ju
>> lianhyde/data-profiling-with-apache-calcite>
>>
>>
>

Re: Questions about MonteCarloAlgorithm to compute tiles of Lattice

Posted by weijie tong <to...@gmail.com>.

@Julian thanks for your reply.
Another question is about `Star schema` requirement. Does this precondition
only affect `Lattice.computeTiles()` method to choose the right dimension
group to be the candidate Tile ?

On Fri, Jul 7, 2017 at 4:44 AM, Julian Hyde <jh...@apache.org> wrote:

>
> > On Jun 28, 2017, at 10:58 PM, weijie tong <to...@gmail.com>
> wrote:
> >
> > HI all:
> >   anyone can explain  the detail of the MonteCarlo algorithm to compute
> > the tiles of a Lattice?
> >    It seems that  MonteCarlo algorithm will simulate every possible query
> > of all kind of  AggregateImpls ,and will choose the lowest cost's ( cost
> > model determined by the estimateCost() method of LatticeImpl )
> > AggregateImpl to be the titles.
>
> ExhaustiveLatticeAlgorithm will try every possible query (2^n if there are
> n attributes), whereas MonteCarloAlgorithm tries a set of random queries.
>
> Both algorithms are greedy algorithms. Each iteration, they assume that a
> set of aggregates have been chosen, and choose the best aggregate to add to
> it by calling getBenefit (which, despite its name, is a cost-benefit
> ratio). Repeat until there are enough aggregates.
>
>
> >    I also find that the cost benefits of the choose AggregateImpls don't
> > play any role to the final output AggregateImpl.
>
> If you’re referring to the list of CostBenefit objects created at the end
> of the algorithm; yes, they are just info to put on the screen and prove
> that the algorithm has done a great job.
>
> But you’ll see that getBenefit is called in the inner loop.
>
>
> >    please correct my opinion and show me the mathematical theory of the
> > MonteCarlo algorithm to  choose the best  aggregates .
>
> MonteCarloAlgorithm could be improved by taking into account historic
> queries, but I think it does a good job for the case where no previous
> queries are not known.
>
> The biggest problem with the algorithm is the amount of time spent
> gathering statistics. My work on data profiling [1] [2] will speed up
> getBenefit hugely because it will be able to answer
> aggregate.estimateRowCount() without executing a query.
>
> Julian
>
> [1] https://issues.apache.org/jira/browse/CALCITE-1616 <
> https://issues.apache.org/jira/browse/CALCITE-1616>
>
> [2] https://www.slideshare.net/julianhyde/data-profiling-
> with-apache-calcite <https://www.slideshare.net/julianhyde/data-profiling-
> with-apache-calcite>
>
>