You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@datasketches.apache.org by Kevin Peng <ke...@audigent.com> on 2022/05/23 17:20:04 UTC
Questions on using Theta Sketches
Hi All,
I am pretty new to the community and I am trying to get my head wrapped
around the usage of the theta sketch python library to compute approx
distinct counts.
Here is my use case:
- I have the following table structure: visit_id, dimension (array),
date (Single GMT day i.e. 1/1/2022)
- I want to run a distinct count of visit_ids over a dynamic date range
and group them by dimension sets i.e. select count(visit_id) where date >=
a and date <= b and dimension contains x or dimension contains y and
dimension contains z
What I am planning is:
- Create a theta sketch cube and store them in a hashtable i.e. dynamodb
using a workflow orchestration tool like airflow for each date
- Retrieve the theta sketch cubes for each day in the date range and do
union and intersection on request
Here is my question:
- I was trying to look at this example:
https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20
- For creating the sketches should I be calculating the distinct
count group by date and dimension first and use that value with the key
being some combination of the dimension and date?
- Would the blob I store into the hashtable be the key that I
construct with the result returned back by the example
generate_theta_sketch method in the example test?
- If this is the case, in order to query a date range I would have
to construct a union of similar dimensions with different
dates within the
date range first before I can do any unions/intersections of different
dimension values in that date range? Is there an easier way?
--
Kevin Peng
Chief Engineer, DMP
305.775.2463
Re: Questions on using Theta Sketches
Posted by Karl Matthias <ka...@community.com>.
Hi Kevin,
Inserting all the `visit_id`s into a ThetaSketch by day will give you a
distinct "set" for the day. You can then union those across the range on
demand, and get a distinct over the arbitrary date range. The one caution I
would make here is that unioning a very large set of sketches, or a set of
very large sketches, is a fairly slow operation and if your ranges are very
large and you need it to be quick (i.e. someone is live waiting on the
request), you might consider pre-calculating sketches on a less granular
level than 1 day as well, to reduce the sketch count. The alternative is
not to pre-calculate anything at all, and to build a sketch for approximate
distinct counts in memory, on the fly. Again, depends on data size, etc.
Hope that helps.
Karl
On Mon, May 23, 2022 at 6:20 PM Kevin Peng <ke...@audigent.com> wrote:
> Hi All,
>
> I am pretty new to the community and I am trying to get my head wrapped
> around the usage of the theta sketch python library to compute approx
> distinct counts.
>
> Here is my use case:
>
> - I have the following table structure: visit_id, dimension (array),
> date (Single GMT day i.e. 1/1/2022)
> - I want to run a distinct count of visit_ids over a dynamic date
> range and group them by dimension sets i.e. select count(visit_id) where
> date >= a and date <= b and dimension contains x or dimension contains y
> and dimension contains z
>
> What I am planning is:
>
> - Create a theta sketch cube and store them in a hashtable i.e.
> dynamodb using a workflow orchestration tool like airflow for each date
> - Retrieve the theta sketch cubes for each day in the date range and
> do union and intersection on request
>
> Here is my question:
>
> - I was trying to look at this example:
> https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20
> - For creating the sketches should I be calculating the distinct
> count group by date and dimension first and use that value with the key
> being some combination of the dimension and date?
> - Would the blob I store into the hashtable be the key that I
> construct with the result returned back by the example
> generate_theta_sketch method in the example test?
> - If this is the case, in order to query a date range I would
> have to construct a union of similar dimensions with different dates within
> the date range first before I can do any unions/intersections of different
> dimension values in that date range? Is there an easier way?
>
>
> --
> Kevin Peng
> Chief Engineer, DMP
> 305.775.2463
>
>
>