You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Mike Beddo <Mi...@dataventures.com> on 2015/10/11 06:07:08 UTC

approximate count distinct?

We are evaluating Drill for making interactive SQL queries against customer sales transaction data. Many of our queries involve computing "penetration" numbers: count of unique customers, count of unique baskets, count of unique stores, etc. So far, using Drill to do aggregations involving COUNT, SUM, ... give acceptable query execution times. When including COUNT(DISTINCT <column>) in our queries, the execution times go from about 1 second to many minutes!

Has someone written a user-defined aggregate function to do approximate counting? We think a Bloom filter will serve our needs best.


-          Mike

Re: approximate count distinct?

Posted by Jacques Nadeau <ja...@dremio.com>.

Hey Ted, the first item you listed actually has straightforward work
around. It isn't elegant but you can use an ObjectHolder as a workspace
variable and then store whatever you want in it.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Oct 12, 2015 at 9:16 PM, Ted Dunning <te...@gmail.com> wrote:

> I started work on this an ran straight into the brick wall that UDAF's
> can't have arbitrary structures as workspaces.
>
> A secondary road-block was that UDAF's can't be made into multi-level
> aggregators.
>
> I can't fix these problems because, there isn't enough documentation.
>
> I moved on to the rest of the gazillion things I need to do but I would
> love to come back to this.
>
>
>
> On Mon, Oct 12, 2015 at 2:59 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
> > This is something that has been talked about multiple times but no one
> has
> > started work on it yet (as far as I know).
> >
> > Do you want to open a JIRA and maybe we can collaborate on getting
> > something put together. There are probably a couple of dependent jiras
> that
> > will need to be resolved but having a concrete and useful UDAF driving
> the
> > requirements may be just the motivation to get help on those dependent
> > JIRAs.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Sat, Oct 10, 2015 at 9:07 PM, Mike Beddo <Mike.Beddo@dataventures.com
> >
> > wrote:
> >
> > > We are evaluating Drill for making interactive SQL queries against
> > > customer sales transaction data. Many of our queries involve computing
> > > "penetration" numbers: count of unique customers, count of unique
> > baskets,
> > > count of unique stores, etc. So far, using Drill to do aggregations
> > > involving COUNT, SUM, ... give acceptable query execution times. When
> > > including COUNT(DISTINCT <column>) in our queries, the execution times
> go
> > > from about 1 second to many minutes!
> > >
> > > Has someone written a user-defined aggregate function to do approximate
> > > counting? We think a Bloom filter will serve our needs best.
> > >
> > >
> > > -          Mike
> > >
> >
>

Re: approximate count distinct?

Posted by Ted Dunning <te...@gmail.com>.

I started work on this an ran straight into the brick wall that UDAF's
can't have arbitrary structures as workspaces.

A secondary road-block was that UDAF's can't be made into multi-level
aggregators.

I can't fix these problems because, there isn't enough documentation.

I moved on to the rest of the gazillion things I need to do but I would
love to come back to this.



On Mon, Oct 12, 2015 at 2:59 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> This is something that has been talked about multiple times but no one has
> started work on it yet (as far as I know).
>
> Do you want to open a JIRA and maybe we can collaborate on getting
> something put together. There are probably a couple of dependent jiras that
> will need to be resolved but having a concrete and useful UDAF driving the
> requirements may be just the motivation to get help on those dependent
> JIRAs.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Oct 10, 2015 at 9:07 PM, Mike Beddo <Mi...@dataventures.com>
> wrote:
>
> > We are evaluating Drill for making interactive SQL queries against
> > customer sales transaction data. Many of our queries involve computing
> > "penetration" numbers: count of unique customers, count of unique
> baskets,
> > count of unique stores, etc. So far, using Drill to do aggregations
> > involving COUNT, SUM, ... give acceptable query execution times. When
> > including COUNT(DISTINCT <column>) in our queries, the execution times go
> > from about 1 second to many minutes!
> >
> > Has someone written a user-defined aggregate function to do approximate
> > counting? We think a Bloom filter will serve our needs best.
> >
> >
> > -          Mike
> >
>

Re: approximate count distinct?

Posted by Jacques Nadeau <ja...@dremio.com>.

This is something that has been talked about multiple times but no one has
started work on it yet (as far as I know).

Do you want to open a JIRA and maybe we can collaborate on getting
something put together. There are probably a couple of dependent jiras that
will need to be resolved but having a concrete and useful UDAF driving the
requirements may be just the motivation to get help on those dependent
JIRAs.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sat, Oct 10, 2015 at 9:07 PM, Mike Beddo <Mi...@dataventures.com>
wrote:

> We are evaluating Drill for making interactive SQL queries against
> customer sales transaction data. Many of our queries involve computing
> "penetration" numbers: count of unique customers, count of unique baskets,
> count of unique stores, etc. So far, using Drill to do aggregations
> involving COUNT, SUM, ... give acceptable query execution times. When
> including COUNT(DISTINCT <column>) in our queries, the execution times go
> from about 1 second to many minutes!
>
> Has someone written a user-defined aggregate function to do approximate
> counting? We think a Bloom filter will serve our needs best.
>
>
> -          Mike
>