You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2014/06/20 21:49:39 UTC

SUM0 vs SUM

Hi,

As you might know, SUM aggregate returns null when aggregating the empty
set.
Optiq replaces SUM(x) with "case when COUNT(x)>0 then SUM0(x) else NULL
end" where SUM0 is guaranteed to have non-nullable type (returns 0 in case
of empty set).

I wonder why is that made and if we can drop that SUM0.

-- 
Regards,
Vladimir Sitnikov

Re: SUM0 vs SUM

Posted by Vladimir Sitnikov <si...@gmail.com>.
Here's one more example how SUM0 complicates things for enumerable
convention: https://github.com/julianhyde/optiq/pull/303 (see generated
code for testWinAgg)

> (b) pushing the process down to another engine, perhaps SQL
The standard SQL has no such thing as SUM0. Try pushing it down to Oracle
DB and you are done.

>(c) computing rolling sum, by adding to a total as rows enter a window and
subtracting from a total as rows leave.
Optiq is doing that already (except subtracting from a total as rows leave
part), and SUM0 just complicates things here.
I understand Optiq is not the only runtime, however I still believe just
having a single SUM is easier.

> perhaps another language such as MongoDB
Do you mean like this one
https://twitter.com/julianhyde/status/461967033115373569?
Does MongoDB support SUM0 better than SUM?

Timothy wrote as follows (
https://groups.google.com/d/msg/optiq-dev/_UMt5UjhgAs/3LlU9wWexRMJ ):
> I'm not sure why Optiq wants to replace the agg call with a $SUM0? This
is causing Drill to not able to find the SUM function call.

>until we know whether we prefer the SUM form?
We might try use the same approach as for AVG expansion (expand SUM to SUM0
if the implementation raises "unable to implement SUM, please use SUM0"),
however SUM0 support is fragile and it is easy to mess with data type
nullability (see RexBuidler.makeOver)

That said, I suggest dodging SUM0. If we do so during repackage, we can
just delete all the SUM0 references.
Otherwise we can just avoid generation of SUM0 and kill the references in
the next release.

​Vladimir

Re: SUM0 vs SUM

Posted by Vladimir Sitnikov <si...@gmail.com>.
>For instance, if we’re pushing down to another SQL engine, we’d rather
push SUM than SUM0.
Exactly!

Vladimir

Re: SUM0 vs SUM

Posted by Julian Hyde <ju...@hydromatic.net>.
Enumerable is not the only way people might implement aggregate functions. Some others are (a) rolling up partial aggregates, (b) pushing the process down to another engine, perhaps SQL, perhaps another language such as MongoDB, (c) computing rolling sum, by adding to a total as rows enter a window and subtracting from a total as rows leave.

SUM0 simplifies the specification of SUM and makes it easier to implement in a variety of engines. It takes NULL values out of the equation.

I concede that the expansion sometimes makes things awkward.  For instance, if we’re pushing down to another SQL engine, we’d rather push SUM than SUM0.

Could we defer the expansion of “SUM" to "CASE … SUM0 …” until we know whether we prefer the SUM form? I don’t know. It’s worth considering.

Julian


Re: SUM0 vs SUM

Posted by Vladimir Sitnikov <si...@gmail.com>.
How difficult it is?

Optiq's Enumerable convention is efficient enough so that SUM0 provides no
benefit: multiple pieces of state are handled seamlessly.

Sharing the effort might be a good thing, however I am not sure if it is
worth the effort of this particular special case.
I would rather implement sharing of the knowlegde "aggregate set had
non-null rows"​, than had all those SUM0 hardcodes here and there.

That is why I want to hear from Optiq users if SUM0 really helps end-users.

Vladimir

Re: SUM0 vs SUM

Posted by Julian Hyde <ju...@hydromatic.net>.
It’s difficult to implement SUM efficiently. It needs two pieces of state: an accumulator (containing the total of values seen so far) and a count (recording whether any records were seen).

SUM0 is simpler to implement. It contains only the accumulator, and outsources the counting to another agg function. Often there are several agg functions which each need a count, so the effort can be shared.

Julian