You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by James Turton <ja...@somecomputer.xyz> on 2020/07/26 14:39:11 UTC

Aggregate UDF and HashAgg

Hi all

I'm writing an aggregate UDF with help from the notes here

https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs

.  I'm printing a line to stderr from each of the UDF methods so I can
keep an eye on the call sequence.  When my UDF is invoked by a
StreamingAgg operator the lifecycle of method calls - setup(), reset(),
add(), output() - is as described in the wiki.  When my UDF is invoked
by a HashAgg operator things change dramatically.  The setup() method is
called some hundreds of times and reset() is never called even though I
have three groups in the query's "group by"!  Anyone know what could be
happening here?

Thanks
James

-- 
PGP public key <http://somecomputer.xyz/james.asc>

Re: Aggregate UDF and HashAgg

Posted by Paul Rogers <pa...@gmail.com>.
Hi James,

The behavior you see can mostly be explained by noting the way the two
aggregates work. The streaming agg is a sequential operator: it works with
sorted data, starts one aggregate, gathers all data, then resets for the
next. The hash agg is a parallel aggregate: it runs all aggregates in
parallel, it will start all aggregates at the same time, add data to each
of them depending on the hash key as it arrives, and complete all
aggregates at the same time at the end. There is no reset needed in a
parallel agg.

The real question is whether the parallel (hash) agg correctly calls the
add method multiple times and the the output once for each of the parallel
aggregates.

You are seeing the key trade-off between the two implementations: the
sequential (streaming) agg is very memory frugal, but requires a sort to
organize data. The parallel (hash) agg requires no sort, at the cost of
more memory to hold all active groups in memory. Classic DB stuff.

Thanks,

- Paul


On Sun, Jul 26, 2020 at 7:56 AM James Turton <ja...@somecomputer.xyz> wrote:

> Hi all
>
> I'm writing an aggregate UDF with help from the notes here
>
> https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
>
> .  I'm printing a line to stderr from each of the UDF methods so I can
> keep an eye on the call sequence.  When my UDF is invoked by a
> StreamingAgg operator the lifecycle of method calls - setup(), reset(),
> add(), output() - is as described in the wiki.  When my UDF is invoked
> by a HashAgg operator things change dramatically.  The setup() method is
> called some hundreds of times and reset() is never called even though I
> have three groups in the query's "group by"!  Anyone know what could be
> happening here?
>
> Thanks
> James
>
> --
> PGP public key <http://somecomputer.xyz/james.asc>
>