You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@datasketches.apache.org by Kevin Peng <ke...@audigent.com> on 2022/05/23 17:20:04 UTC

Questions on using Theta Sketches

Hi All,

I am pretty new to the community and I am trying to get my head wrapped
around the usage of the theta sketch python library to compute approx
distinct counts.

Here is my use case:

   - I have the following table structure: visit_id, dimension (array),
   date (Single GMT day i.e. 1/1/2022)
   - I want to run a distinct count of visit_ids over a dynamic date range
   and group them by dimension sets i.e. select count(visit_id) where date >=
   a and date <= b and dimension contains x or dimension contains y and
   dimension contains z

What I am planning is:

   - Create a theta sketch cube and store them in a hashtable i.e. dynamodb
   using a workflow orchestration tool like airflow for each date
   - Retrieve the theta sketch cubes for each day in the date range and do
   union and intersection on request

Here is my question:

   - I was trying to look at this example:
   https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20
      - For creating the sketches should I be calculating the distinct
      count group by date and dimension first and use that value with the key
      being some combination of the dimension and date?
      - Would the blob I store into the hashtable be the key that I
      construct with the result returned back by the example
      generate_theta_sketch method in the example test?
         - If this is the case, in order to query a date range I would have
         to construct a union of similar dimensions with different
dates within the
         date range first before I can do any unions/intersections of different
         dimension values in that date range?  Is there an easier way?


-- 
Kevin Peng
Chief Engineer, DMP
305.775.2463

Re: Questions on using Theta Sketches

Posted by Karl Matthias <ka...@community.com>.
Hi Kevin,

Inserting all the `visit_id`s into a ThetaSketch by day will give you a
distinct "set" for the day. You can then union those across the range on
demand, and get a distinct over the arbitrary date range. The one caution I
would make here is that unioning a very large set of sketches, or a set of
very large sketches, is a fairly slow operation and if your ranges are very
large and you need it to be quick (i.e. someone is live waiting on the
request), you might consider pre-calculating sketches on a less granular
level than 1 day as well, to reduce the sketch count. The alternative is
not to pre-calculate anything at all, and to build a sketch for approximate
distinct counts in memory, on the fly. Again, depends on data size, etc.

Hope that helps.

Karl

On Mon, May 23, 2022 at 6:20 PM Kevin Peng <ke...@audigent.com> wrote:

> Hi All,
>
> I am pretty new to the community and I am trying to get my head wrapped
> around the usage of the theta sketch python library to compute approx
> distinct counts.
>
> Here is my use case:
>
>    - I have the following table structure: visit_id, dimension (array),
>    date (Single GMT day i.e. 1/1/2022)
>    - I want to run a distinct count of visit_ids over a dynamic date
>    range and group them by dimension sets i.e. select count(visit_id) where
>    date >= a and date <= b and dimension contains x or dimension contains y
>    and dimension contains z
>
> What I am planning is:
>
>    - Create a theta sketch cube and store them in a hashtable i.e.
>    dynamodb using a workflow orchestration tool like airflow for each date
>    - Retrieve the theta sketch cubes for each day in the date range and
>    do union and intersection on request
>
> Here is my question:
>
>    - I was trying to look at this example:
>    https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20
>       - For creating the sketches should I be calculating the distinct
>       count group by date and dimension first and use that value with the key
>       being some combination of the dimension and date?
>       - Would the blob I store into the hashtable be the key that I
>       construct with the result returned back by the example
>       generate_theta_sketch method in the example test?
>          - If this is the case, in order to query a date range I would
>          have to construct a union of similar dimensions with different dates within
>          the date range first before I can do any unions/intersections of different
>          dimension values in that date range?  Is there an easier way?
>
>
> --
> Kevin Peng
> Chief Engineer, DMP
> 305.775.2463
>
>
>