You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Ozerov <pp...@gmail.com> on 2021/02/26 12:29:16 UTC
Non-additive costs in heterogeneous engines

Hi,

Several products that utilize Apache Calcite for query optimization might
use multiple execution units to execute physical operators concurrently.
Think of a heterogeneous engine that might split execution between a CPU
and a co-processor (GPU, FGPA, etc), or just a multi-core machine in the
simplest case. In such systems, the cumulative cost of an operator is not
additive. That is, cost(A,B) != cost(A) + cost(B).

Consider a theoretical system that might execute operations on either CPU
or GPU. There are Scan and Join operators. For every operator, there are
two physical alternatives - execute on CPU or execute on GPU. We also have
a valid cost model that provides comparable costs for both CPU and GPU
backends. Both CPU and GPU could execute one operator at a time.

Now consider that we have the following logical plan:
LogicalJoin
  LogicalScan[a]
  LogicalScan[b]

We then expand the MEMO with physical alternatives (omitting some
alternatives for clarity):
Set#1 {
  Subset#1[CPU]: CpuJoin[Subset#3, Subset#5]
}
Set#2 {
  Subset#3[CPU]: {
    CpuScan[a], cost=[100] (best)
    GpuToCpu[Subset#4], cost=[170]
  }
  Subset#4[GPU]: {
    GpuScan[a], cost=[150]
  }
}
Set#3 {
  Subset#5[CPU]: {
    CpuScan[b], cost=[100] (best)
    GpuToCpu[Subset#6], cost=[170]
  }
  Subset#6[GPU]: {
    GpuScan[b], cost=[150]
  }
}

With the current model, Apache Calcite will only consider "best" rels from
each subset when constructing the final tree, which means that the
following tree would be the winner:
CpuJoin, cost=[200]
  CpuScan[a], cost=[100] (best from Subset#3)
  CpuScan[b], cost=[100] (best from Subset#5)

However, the better plan might be those, which utilizes the cross-device
parallelism:
CpuJoin, cost=MAX[100, 170]
  CpuScan[a], cost=[100] (best from Subset#3)
  GpuToCpu[b], cost=[170] (*not best* from Subset#5)
    GpuScan[b], cost=[150]

It seems that to support this kind of optimization with non-additive costs,
we need to switch from cost calculation on the operator level (local) to a
cost calculation that considers alternative paths for all subtrees
(global). Obviously, this could lead to a combinatorial explosion easily.
On the other hand, the new top-down optimizer with a more predictable rule
invocation order might significantly amortize the additional complexity.

I want to ask whether similar ideas were discussed in the community before?
Are you aware of any practical systems that do such kind of optimization?
Or any fundamental results that prove this idea to be infeasible?
I would appreciate any hints on the matter.

Regards,
Vladimir.