You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@datasketches.apache.org by Will Lauer <wl...@yahooinc.com> on 2022/05/02 16:39:32 UTC

Re: [E] Re: Using Quantile sketches for additive metrics

OK, this is interesting. I've got some concerns and questions that I've put
inline below.

Will



Will Lauer


Senior Principal Architect, Audience & Advertising Reporting

Data Platforms & Systems Engineering


M 508 561 6427

Champaign Office

1908 S. First St

Champaign, IL 61822



On Mon, May 2, 2022 at 9:19 AM vijay rajan <vi...@gmail.com>
wrote:

> Thanks Lee. Please find my answers inline below in blue. I think you will
> find my use case very interesting. My next endeavor would be to make a
> decision tree with entropy / gini impurity measures with sketches. I am
> amazed at some of the results I have gotten. You may find this quite
> interesting.
>
> On Sun, May 1, 2022 at 12:55 AM leerho <le...@gmail.com> wrote:
>
>> Vijay,
>> Sorry about the delay in getting back to you.
>>
>> There is some critical information missing from your description and that
>> is the domain of what you are sketching.
>> I presume that it is User-IDs, otherwise it doesn't make sense.
>> If this is the case I think the solution can be achieved in a couple of
>> ways.  Going forward with this assumption:
>>
>
>> Your raw data would look something like this:
>>
>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>
>>
>>
> Actually it would be an event_id instead of user_Id. The domain for the
> data is a hypothetical online advertising summary.
> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>
> The problem that we faced in the past was that it was impossible to "join"
> a click to an impression. While the raw impression data would have a unique
> event_id the raw click data would also have its own event_id. The
> dimensions were the same for clicks and impressions. What we did back in
> thge day was aggregate the data for impressions separately and then
> aggregate for clicks separately and then join.
>
> You could assume the same here. Each dimension aggregate tuple {AgeGroup,
> Gender and Country in out case} would have a theta sketch for impressions
> and another for clicks.
>

I don't see what a "theta sketch for impressions and another for clicks"
would mean here. Theta sketches are for doing distinct counts, so I could
understand a theta sketch for EventId, counting the number of distinct
event ids associated with the AgeGroup/Gender/Country combination. But that
doesn't seem to make sense for clicks or impressions. If your original data
was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
implies to me that clicks and impressions would always be 1 (that is, an
event is either a single click or single impression). When you drop the
EventId dimension, the natural aggregation for these (as well as for
AdSpend) is summation, giving you the total clicks, total impressions (and
total AdSpend) for the Age/Gender/Country combination. If you were to
generate a Theta sketch, it seems like it would be non-sensical, as you all
impressions or clicks inserted would be the same value and the theta sketch
would always return "1".

>
>
>
>> A proposed solution would be to configure a TupleSketch with 3 fields:  (Impressions,
>> Clicks, AdSpend).
>> (This requires a little programming.  You would need to create classes
>> that implement the interfaces Summary,
>> SummaryFactory and SummarySetOperations.  There are examples of how to do
>> this on the website.)
>>
>>
> I will look into Tuple sketch and the SummaryFactory and Summary Set
> operations. I did try reading it but I cannot say I understood this
> completely. What type of sketch would we use for AdSpend? Would it be
> Quantile?
>

>
>
>> Then you would process your raw data as follows:
>>
>>    - Partition your raw data into 213 streams defined by the three
>>    dimensions:
>>       - 10 X AgeGroup
>>       - 3 X Gender
>>       - 200 X Country
>>    - Each stream would feed a configured TupleSketch as above. (so you
>>    need only 213 sketches).
>>    - These sketches are in the form of a 3D hypercube.
>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>
>>
> I believe I did something similar. Let me try explaining this with
> concrete examples
>
>    1. This is the raw data
>    <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>    click on the link https://tinyurl.com/3d2r7xjh
>    <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>    ) for a company in India that provides car drivers for hire. The data has
>    most columns as categorical variables except the first which is tripId. The
>    schema looks something like this
>       1. { *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>       *tripOutcome* }
>       2. tripId the first column, is distinct per row which I put into
>       theta sketches as shown later
>       3. tripOutcome the last column, is a classification variable which
>       has values "bad or good"
>       4. The problem statement is to dod data analysis of combinations of
>       dimensions that produce an unfavourable result that is
>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>    2. I was able to populate theta sketches for each dimension and value
>    pair (example "city=Mumbai")  and store them as my effective data
>    representation.
>       1. with 10 as base and hence nominal entries as 2^10 the data is
>       https://tinyurl.com/bkvz4xrh
>       <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>
>       2. with 12 as base and hence nominal entries as 2^12 the data is
>       https://tinyurl.com/f9mzm8ef
>       <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>
>       3. with 14 as base and hence nominal entries as 2^14 the data is
>       https://tinyurl.com/2p8kwj3m
>       <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>       4. The data looks like this
>          1. dimension=value, Base64 encoded Theta sketch formed from
>          unique tripId
>
> I'm not sure what you are sketching here. What is the entry being inserted
into the sketch? Basically, what are you counting? Normally, I would think
of sketching as taking your original data (
tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
tripOutcome ) and picking a column (or columns) that you want to count the
distinct values of and generating a new table that contains the dimensions
you care about plus that new sketch. In this case, if you are counting
trips, the resulting table would look like {source, estimate_usage_bins,
city, zone, zone_demand_popularity, dayType, pickupHourOfDay, sla,
booking_type, tripOutcome, sketch(tripId)}. You could then report on a
variety of combinations by dropping dimensions and aggregating (merging)
the sketches.

>
>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>    <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>    ] to generate a combination of "dimension=value" with a minimum support
>    threshold. The input to this implementation takes the files above at point
>    2. https://tinyurl.com/2p8kwj3m
>    <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad and
>    tripOutcome=good and then join the results. The answers I get are listed
>    below
>       1. with input having dimension=value and theta sketch with 2^10
>       nominal entries answers can be found here
>       https://tinyurl.com/yc3a3rde
>       <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>
>       2. with input having dimension=value and theta sketch with 2^12
>       nominal entries answers can be found here
>       https://tinyurl.com/5n7jjr32
>       <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>       3. with input having dimension=value and theta sketch with 2^14
>       nominal entries answers can be found here
>       https://tinyurl.com/3wrxp4a7
>       <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>    3.  The results show that 2^14 (16K) nominal sketches give immaculate
>    results. There are errors in the results from the other sketches (RMSE).
>    4. Sample of results
>       1.
>
>       *level of combination, rule for combination, bad count, good count,
>       percentage of bad count*
>       2.
>
>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>       source=android,26.0,9.0,35.0,74.0
>
>       3,booking_type=one_way_trip & city=Pune &
>       source=android,26.0,10.0,36.0,72.0
>
>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>       source=ios & zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>
>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>
>       5,booking_type=one_way_trip & city=Delhi NCR &
>       pickUpHourOfDay=VERYLATE & source=ios &
>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>
>       3,booking_type=one_way_trip & sla=Scheduled &
>       zone=205,14.0,9.0,23.0,61.0
>
> I am aware from one of Lee's recent youtube videos that theta sketches
> work like Sets(Bit sets) when large enough and the data is small. My use
> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
> much higher. I could use the jaccard formula explained by Lee to get the
> margin of error.
>
>
> Sorry for the details but this should help you suggest if what I am doing
> is indeed what tuple sketches does on its own. I am using update sketches
> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>
>
>
>
>> A query would choose the desired coordinates of the 3 dimensions.
>> If the query selects more than one coordinate from any one dimension,
>> then you would need to first merge the corresponding coordinate sketches
>> of that dimension together.
>> A missing dimension would mean first merging all coordinate sketches of
>> that dimension together.
>> The final result sketch is an Intersection of the resulting 3 dimensional
>> sketches.
>>
>
> Yes. the code here should serve as an example https://tinyurl.com/4pzfsj6v
> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>
>
>
>>
>> With the resulting Tuple sketch you iterate over all the retained items
>> (Summaries) and sum each of the 3 fields.
>> Then you divide each sum by Theta.  The result is the estimate of
>> Impressions, Clicks, and AdSpend
>> for the population of users selected by the query.  The size of that
>> population would be obtained by getEstimate().
>>
>>
> My question remains. Which base sketch type should I use for an additive
> metric like Ad-spend? Would that be a *quantile sketch*?
>

I still think this question doesn't make sense. What are you trying to
understand with the sketch? Are you trying to find out the distribution of
adspend for a particular set of dimension values? Then that sounds like a
quantile sketch, Are you trying to find out how many different values of
adspend there were? That's a distinct counting sketch. Are you trying to
find out the most common ad spend values? That would be a frequent items
sketch.

But really, none of those actually seem to match what you described above.
What you described above sounded like you were simply trying to calculate
the total AdSpend, which just requires sum, not a sketch.

>
>
>
>> Be aware that Intersections, by their nature, can increase error
>> significantly. (see the Website).
>> Try to avoid doing intersections with sketch coordinates that have very
>> small populations, if possible.
>>
>
> Yes. However if I use very high nominal entries like 2^20 shouldn't this
> get mitigated?
>

The point of using a sketch is estimation. If you can't tolerate any error,
you shouldn't be using the sketches in the first place. don't just assume
that you SHOULD use a high nominal entries. That said, when using theta
sketches and doing intersections, we do sometimes use values like 2^20 for
nominal entries, but often it is overkill. Remember, in the case of
intersections, you get high error when the intersection size is small
compared with the overall set sizes, The case where you see high relative
error are the cases where you are looking at small intersections. In most
use cases, a relative error of 1,000% or even 10,000%  on an intersection
size of 10 when comparing sets of a billion is still a reasonable error
given how small the _actual_ value is.

>
>
>
>> You can get a sense of how good your results are by utilizing the
>> getUpperBound() and getLowerBound()
>> (with respect to the user population.).  The bigger that spread is, the
>> worse your estimates will be.
>>
>> Thanks. This is perfect. I do not need to use the Jaccard to get my
> estimate
>
>
>
>> The other option would be to do the cross-product upfront and create 6000
>> Tuple sketches.
>> In this case the input tuple would feed only one sketch and the query
>> would be to only one sketch.
>>
>
> Since I am not using SQL queries and have my own unique use case, my
> questions are quite different.
>
>
>> This can quickly get unwieldy with more dimensions or high cardinality
>> dimensions.
>> It is more accurate because it avoids the intersections.  It is also
>> possible to do a mix of
>> these two approaches, but you would need to think it through carefully.
>>
>> I hope this helps.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <vi...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Just like theta sketches are used for distinct count metrics like
>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>
>>> There is a big opportunity that I see in storing very little data in
>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>> data very fast (although it is approximate).
>>>
>>> This question is best explained with an example. Say I have a fact table
>>> schema as follows
>>>
>>>
>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>    {less than 18, 19-25, 26-35....}
>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>    3. *Country* is a  dimension with 200 possible values
>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>    Theta sketch or HLL)
>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>    sketch or HLL)
>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>    with quantile sketch??*)
>>>
>>>
>>> The maximum possible number of entries in this table would be the cross
>>> product of the cardinalities of the dimensions which is 10(Age_Group) x
>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>> group by queries using set operations that sketches provides.
>>>
>>> For impression metric, one sketch for Gender=F, another sketch for
>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>> well.
>>> For click metric, one sketch for Gender=F, another sketch for Gender=M,
>>> yet another for Gender=Unknown and so on for other dimensions as well.
>>> Question is what sketch would I need for ad-spend?
>>>
>>> On a side note, I have used Theta sketches in this manner and even
>>> implemented ECLAT for frequent itemset computations and association rule
>>> mining ... example from another project below with theta sketches used for
>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>
>>> Level Conjunction Count
>>> 2 H10 & MemberGender=F 74
>>> 2 M15 & MemberGender=F 83
>>> 2 G31 & MemberGender=F 59
>>> 2 MemberGender=F & R13 66
>>> *In the example above, H10, M15 etc are International Disease codes for
>>> specific diseases. *https://www.aapc.com/codes/code-search/
>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>
>>> Hope this is a clear representation of the problem.
>>>
>>> Regards
>>> Vijay
>>>
>>>
>>>
>>>
>>>

Re: [E] Re: Using Quantile sketches for additive metrics

Posted by vijay rajan <vi...@gmail.com>.

Hi Alex,

You are right. I assumed that quantiles would have an intersection like
Theta Sketches but they don't.

If it did based on transaction ID, it would have been cool. So many
traditional data mining & predictive analytics algorithms could be
reimplemented with sketches.

Thanks for your insight.

Vijay

On Mon, May 23, 2022, 22:18 Alexander Saydakov <sa...@yahooinc.com>
wrote:

> This seems to be a convoluted way of computing the sum of all values. This
> is an additive metric, easy to compute exactly, no sketches needed.
>
> On Sun, May 22, 2022 at 5:56 AM vijay rajan <vi...@gmail.com>
> wrote:
>
>> Hi folks (And Lee),
>>
>> I think I have found what I was looking for in quantile sketches though I
>> am not able to formulate error bounds for the same.
>> I should have raised a PR request but I am going to write the code here.
>> The code below estimates the volume of the quantile sketche based on the
>> example
>> https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html
>> <https://urldefense.com/v3/__https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html__;!!Op6eflyXZCqGR5I!CL-lHnjU1bAay6Rk9U0ORoIKw-HDloG81s_Kve9k0arHZpZRyLYDHO0iJGTgIHqo-4SrlYoii5ErA1X6UVOv50QOcuzs$>
>> .
>>
>> The assumption is that the values put in the sketch are all greater than
>> 0. It kind of works like this.
>> 1. Get 100 quantiles using getQuantiles.
>> 2. Get Histogram percentage numbers
>> 3. The average of the boundaries of the quantiles(getMinValue and
>> getMaxValue for the extremities)  will be used with getN to estimate the
>> volume.
>>
>> Code is straightforward. I think this was what I was looking for and
>> honestly I do not care if the max and min values are really extreme
>> compared to the rest of the numbers as my use case is supposed to remove
>> outliers.
>>
>> private static double estimateVolume(DoublesSketch sketch) {
>>     double retVal = 0.0;
>>     int n = 100;
>>
>>     double [] quantiles = sketch.getQuantiles(n);
>>     int totalItems = sketch.getRetainedItems();
>>     double [] histogramCounts = new double[n];
>>
>>     //Arrays.sort(quantiles);
>>     histogramCounts = sketch.getPMF(quantiles);
>>
>>     //compute the volume
>>     retVal += histogramCounts[0] * sketch.getN() * ((sketch.getMinValue() +  quantiles[0]) / 2);
>>     for (int i = 1; i < n-1; i++) {
>>         retVal += histogramCounts[i] * sketch.getN() * (( quantiles[i-1]+  quantiles[i]) / 2);
>>     }
>>     retVal += histogramCounts[n-1] * sketch.getN() * ((sketch.getMaxValue() +  quantiles[n-1]) / 2);
>>
>>     return retVal;
>> }
>>
>>
>>
>> Regards
>> Vijay
>>
>>
>>
>>
>>
>> On Mon, May 2, 2022 at 10:10 PM Will Lauer <wl...@yahooinc.com> wrote:
>>
>>> OK, this is interesting. I've got some concerns and questions that I've
>>> put inline below.
>>>
>>> Will
>>>
>>>
>>>
>>> Will Lauer
>>>
>>>
>>> Senior Principal Architect, Audience & Advertising Reporting
>>>
>>> Data Platforms & Systems Engineering
>>>
>>>
>>> M 508 561 6427
>>>
>>> Champaign Office
>>>
>>> 1908 S. First St
>>>
>>> Champaign, IL 61822
>>>
>>>
>>>
>>> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vi...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Lee. Please find my answers inline below in blue. I think you
>>>> will find my use case very interesting. My next endeavor would be to make a
>>>> decision tree with entropy / gini impurity measures with sketches. I am
>>>> amazed at some of the results I have gotten. You may find this quite
>>>> interesting.
>>>>
>>>> On Sun, May 1, 2022 at 12:55 AM leerho <le...@gmail.com> wrote:
>>>>
>>>>> Vijay,
>>>>> Sorry about the delay in getting back to you.
>>>>>
>>>>> There is some critical information missing from your description and
>>>>> that is the domain of what you are sketching.
>>>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>>>> If this is the case I think the solution can be achieved in a couple
>>>>> of ways.  Going forward with this assumption:
>>>>>
>>>>
>>>>> Your raw data would look something like this:
>>>>>
>>>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>>>
>>>>>
>>>>>
>>>> Actually it would be an event_id instead of user_Id. The domain for the
>>>> data is a hypothetical online advertising summary.
>>>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>>
>>>> The problem that we faced in the past was that it was impossible to
>>>> "join" a click to an impression. While the raw impression data would have a
>>>> unique event_id the raw click data would also have its own event_id. The
>>>> dimensions were the same for clicks and impressions. What we did back in
>>>> thge day was aggregate the data for impressions separately and then
>>>> aggregate for clicks separately and then join.
>>>>
>>>> You could assume the same here. Each dimension aggregate tuple
>>>> {AgeGroup, Gender and Country in out case} would have a theta sketch for
>>>> impressions and another for clicks.
>>>>
>>>
>>> I don't see what a "theta sketch for impressions and another for clicks"
>>> would mean here. Theta sketches are for doing distinct counts, so I could
>>> understand a theta sketch for EventId, counting the number of distinct
>>> event ids associated with the AgeGroup/Gender/Country combination. But that
>>> doesn't seem to make sense for clicks or impressions. If your original data
>>> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
>>> implies to me that clicks and impressions would always be 1 (that is, an
>>> event is either a single click or single impression). When you drop the
>>> EventId dimension, the natural aggregation for these (as well as for
>>> AdSpend) is summation, giving you the total clicks, total impressions (and
>>> total AdSpend) for the Age/Gender/Country combination. If you were to
>>> generate a Theta sketch, it seems like it would be non-sensical, as you all
>>> impressions or clicks inserted would be the same value and the theta sketch
>>> would always return "1".
>>>
>>>>
>>>>
>>>>
>>>>> A proposed solution would be to configure a TupleSketch with 3 fields:
>>>>>  (Impressions, Clicks, AdSpend).
>>>>> (This requires a little programming.  You would need to create classes
>>>>> that implement the interfaces Summary,
>>>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>>>> do this on the website.)
>>>>>
>>>>>
>>>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>>>> operations. I did try reading it but I cannot say I understood this
>>>> completely. What type of sketch would we use for AdSpend? Would it be
>>>> Quantile?
>>>>
>>>
>>>>
>>>>
>>>>> Then you would process your raw data as follows:
>>>>>
>>>>>    - Partition your raw data into 213 streams defined by the three
>>>>>    dimensions:
>>>>>       - 10 X AgeGroup
>>>>>       - 3 X Gender
>>>>>       - 200 X Country
>>>>>    - Each stream would feed a configured TupleSketch as above. (so
>>>>>    you need only 213 sketches).
>>>>>    - These sketches are in the form of a 3D hypercube.
>>>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>>>
>>>>>
>>>> I believe I did something similar. Let me try explaining this with
>>>> concrete examples
>>>>
>>>>    1. This is the raw data
>>>>    <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>>>    click on the link https://tinyurl.com/3d2r7xjh
>>>>    <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>>>    ) for a company in India that provides car drivers for hire. The data has
>>>>    most columns as categorical variables except the first which is tripId. The
>>>>    schema looks something like this
>>>>       1. { *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>>>       *tripOutcome* }
>>>>       2. tripId the first column, is distinct per row which I put into
>>>>       theta sketches as shown later
>>>>       3. tripOutcome the last column, is a classification variable
>>>>       which has values "bad or good"
>>>>       4. The problem statement is to dod data analysis of combinations
>>>>       of dimensions that produce an unfavourable result that is
>>>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>>>    2. I was able to populate theta sketches for each dimension and
>>>>    value pair (example "city=Mumbai")  and store them as my effective data
>>>>    representation.
>>>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>>>       https://tinyurl.com/bkvz4xrh
>>>>       <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>>>
>>>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>>>       https://tinyurl.com/f9mzm8ef
>>>>       <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>>>
>>>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>>>       https://tinyurl.com/2p8kwj3m
>>>>       <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>>>       4. The data looks like this
>>>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>>>          unique tripId
>>>>
>>>> I'm not sure what you are sketching here. What is the entry being
>>> inserted into the sketch? Basically, what are you counting? Normally, I
>>> would think of sketching as taking your original data (  tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>> tripOutcome ) and picking a column (or columns) that you want to count
>>> the distinct values of and generating a new table that contains the
>>> dimensions you care about plus that new sketch. In this case, if you are
>>> counting trips, the resulting table would look like {source,
>>> estimate_usage_bins, city, zone, zone_demand_popularity, dayType,
>>> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could
>>> then report on a variety of combinations by dropping dimensions and
>>> aggregating (merging) the sketches.
>>>
>>>>
>>>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>>>    <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>>>    ] to generate a combination of "dimension=value" with a minimum support
>>>>    threshold. The input to this implementation takes the files above at point
>>>>    2. https://tinyurl.com/2p8kwj3m
>>>>    <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>>>    and tripOutcome=good and then join the results. The answers I get are
>>>>    listed below
>>>>       1. with input having dimension=value and theta sketch with 2^10
>>>>       nominal entries answers can be found here
>>>>       https://tinyurl.com/yc3a3rde
>>>>       <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>>>
>>>>       2. with input having dimension=value and theta sketch with 2^12
>>>>       nominal entries answers can be found here
>>>>       https://tinyurl.com/5n7jjr32
>>>>       <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>>>       3. with input having dimension=value and theta sketch with 2^14
>>>>       nominal entries answers can be found here
>>>>       https://tinyurl.com/3wrxp4a7
>>>>       <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>>>    3.  The results show that 2^14 (16K) nominal sketches give
>>>>    immaculate results. There are errors in the results from the other sketches
>>>>    (RMSE).
>>>>    4. Sample of results
>>>>       1.
>>>>
>>>>       *level of combination, rule for combination, bad count, good
>>>>       count, percentage of bad count*
>>>>       2.
>>>>
>>>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>>>       source=android,26.0,9.0,35.0,74.0
>>>>
>>>>       3,booking_type=one_way_trip & city=Pune &
>>>>       source=android,26.0,10.0,36.0,72.0
>>>>
>>>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>>>       source=ios & zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>>>
>>>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>>>
>>>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>>>       pickUpHourOfDay=VERYLATE & source=ios &
>>>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>>>
>>>>       3,booking_type=one_way_trip & sla=Scheduled &
>>>>       zone=205,14.0,9.0,23.0,61.0
>>>>
>>>> I am aware from one of Lee's recent youtube videos that theta sketches
>>>> work like Sets(Bit sets) when large enough and the data is small. My use
>>>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>>>> much higher. I could use the jaccard formula explained by Lee to get the
>>>> margin of error.
>>>>
>>>>
>>>> Sorry for the details but this should help you suggest if what I am
>>>> doing is indeed what tuple sketches does on its own. I am using update
>>>> sketches for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
>>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>>>
>>>>
>>>>
>>>>
>>>>> A query would choose the desired coordinates of the 3 dimensions.
>>>>> If the query selects more than one coordinate from any one dimension,
>>>>> then you would need to first merge the corresponding coordinate
>>>>> sketches of that dimension together.
>>>>> A missing dimension would mean first merging all coordinate sketches
>>>>> of that dimension together.
>>>>> The final result sketch is an Intersection of the resulting 3
>>>>> dimensional sketches.
>>>>>
>>>>
>>>> Yes. the code here should serve as an example
>>>> https://tinyurl.com/4pzfsj6v
>>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>>>
>>>>
>>>>
>>>>>
>>>>> With the resulting Tuple sketch you iterate over all the retained
>>>>> items (Summaries) and sum each of the 3 fields.
>>>>> Then you divide each sum by Theta.  The result is the estimate of
>>>>> Impressions, Clicks, and AdSpend
>>>>> for the population of users selected by the query.  The size of that
>>>>> population would be obtained by getEstimate().
>>>>>
>>>>>
>>>> My question remains. Which base sketch type should I use for an
>>>> additive metric like Ad-spend? Would that be a *quantile sketch*?
>>>>
>>>
>>> I still think this question doesn't make sense. What are you trying to
>>> understand with the sketch? Are you trying to find out the distribution of
>>> adspend for a particular set of dimension values? Then that sounds like a
>>> quantile sketch, Are you trying to find out how many different values of
>>> adspend there were? That's a distinct counting sketch. Are you trying to
>>> find out the most common ad spend values? That would be a frequent items
>>> sketch.
>>>
>>> But really, none of those actually seem to match what you described
>>> above. What you described above sounded like you were simply trying to
>>> calculate the total AdSpend, which just requires sum, not a sketch.
>>>
>>>>
>>>>
>>>>
>>>>> Be aware that Intersections, by their nature, can increase error
>>>>> significantly. (see the Website).
>>>>> Try to avoid doing intersections with sketch coordinates that have
>>>>> very small populations, if possible.
>>>>>
>>>>
>>>> Yes. However if I use very high nominal entries like 2^20 shouldn't
>>>> this get mitigated?
>>>>
>>>
>>> The point of using a sketch is estimation. If you can't tolerate any
>>> error, you shouldn't be using the sketches in the first place. don't just
>>> assume that you SHOULD use a high nominal entries. That said, when using
>>> theta sketches and doing intersections, we do sometimes use values like
>>> 2^20 for nominal entries, but often it is overkill. Remember, in the case
>>> of intersections, you get high error when the intersection size is small
>>> compared with the overall set sizes, The case where you see high relative
>>> error are the cases where you are looking at small intersections. In most
>>> use cases, a relative error of 1,000% or even 10,000%  on an intersection
>>> size of 10 when comparing sets of a billion is still a reasonable error
>>> given how small the _actual_ value is.
>>>
>>>>
>>>>
>>>>
>>>>> You can get a sense of how good your results are by utilizing the
>>>>> getUpperBound() and getLowerBound()
>>>>> (with respect to the user population.).  The bigger that spread is,
>>>>> the worse your estimates will be.
>>>>>
>>>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>>>> estimate
>>>>
>>>>
>>>>
>>>>> The other option would be to do the cross-product upfront and create
>>>>> 6000 Tuple sketches.
>>>>> In this case the input tuple would feed only one sketch and the query
>>>>> would be to only one sketch.
>>>>>
>>>>
>>>> Since I am not using SQL queries and have my own unique use case, my
>>>> questions are quite different.
>>>>
>>>>
>>>>> This can quickly get unwieldy with more dimensions or high cardinality
>>>>> dimensions.
>>>>> It is more accurate because it avoids the intersections.  It is also
>>>>> possible to do a mix of
>>>>> these two approaches, but you would need to think it through carefully.
>>>>>
>>>>> I hope this helps.
>>>>>
>>>>> Lee.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>>>> vijay.sankar.rajan@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Just like theta sketches are used for distinct count metrics like
>>>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>>>
>>>>>> There is a big opportunity that I see in storing very little data in
>>>>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>>>>> data very fast (although it is approximate).
>>>>>>
>>>>>> This question is best explained with an example. Say I have a fact
>>>>>> table schema as follows
>>>>>>
>>>>>>
>>>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>>>    {less than 18, 19-25, 26-35....}
>>>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>>>    3. *Country* is a  dimension with 200 possible values
>>>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>>>    Theta sketch or HLL)
>>>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>>>    sketch or HLL)
>>>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>>>    with quantile sketch??*)
>>>>>>
>>>>>>
>>>>>> The maximum possible number of entries in this table would be the
>>>>>> cross product of the cardinalities of the dimensions which is 10(Age_Group)
>>>>>> x 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>>>> group by queries using set operations that sketches provides.
>>>>>>
>>>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>>>> well.
>>>>>> For click metric, one sketch for Gender=F, another sketch for
>>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>>>> well.
>>>>>> Question is what sketch would I need for ad-spend?
>>>>>>
>>>>>> On a side note, I have used Theta sketches in this manner and even
>>>>>> implemented ECLAT for frequent itemset computations and association rule
>>>>>> mining ... example from another project below with theta sketches used for
>>>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>>>
>>>>>> Level Conjunction Count
>>>>>> 2 H10 & MemberGender=F 74
>>>>>> 2 M15 & MemberGender=F 83
>>>>>> 2 G31 & MemberGender=F 59
>>>>>> 2 MemberGender=F & R13 66
>>>>>> *In the example above, H10, M15 etc are International Disease codes
>>>>>> for specific diseases. *https://www.aapc.com/codes/code-search/
>>>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>>>
>>>>>> Hope this is a clear representation of the problem.
>>>>>>
>>>>>> Regards
>>>>>> Vijay
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: [E] Re: Using Quantile sketches for additive metrics

Posted by Alexander Saydakov <sa...@yahooinc.com>.

This seems to be a convoluted way of computing the sum of all values. This
is an additive metric, easy to compute exactly, no sketches needed.

On Sun, May 22, 2022 at 5:56 AM vijay rajan <vi...@gmail.com>
wrote:

> Hi folks (And Lee),
>
> I think I have found what I was looking for in quantile sketches though I
> am not able to formulate error bounds for the same.
> I should have raised a PR request but I am going to write the code here.
> The code below estimates the volume of the quantile sketche based on the
> example
> https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html
> <https://urldefense.com/v3/__https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html__;!!Op6eflyXZCqGR5I!CL-lHnjU1bAay6Rk9U0ORoIKw-HDloG81s_Kve9k0arHZpZRyLYDHO0iJGTgIHqo-4SrlYoii5ErA1X6UVOv50QOcuzs$>
> .
>
> The assumption is that the values put in the sketch are all greater than
> 0. It kind of works like this.
> 1. Get 100 quantiles using getQuantiles.
> 2. Get Histogram percentage numbers
> 3. The average of the boundaries of the quantiles(getMinValue and
> getMaxValue for the extremities)  will be used with getN to estimate the
> volume.
>
> Code is straightforward. I think this was what I was looking for and
> honestly I do not care if the max and min values are really extreme
> compared to the rest of the numbers as my use case is supposed to remove
> outliers.
>
> private static double estimateVolume(DoublesSketch sketch) {
>     double retVal = 0.0;
>     int n = 100;
>
>     double [] quantiles = sketch.getQuantiles(n);
>     int totalItems = sketch.getRetainedItems();
>     double [] histogramCounts = new double[n];
>
>     //Arrays.sort(quantiles);
>     histogramCounts = sketch.getPMF(quantiles);
>
>     //compute the volume
>     retVal += histogramCounts[0] * sketch.getN() * ((sketch.getMinValue() +  quantiles[0]) / 2);
>     for (int i = 1; i < n-1; i++) {
>         retVal += histogramCounts[i] * sketch.getN() * (( quantiles[i-1]+  quantiles[i]) / 2);
>     }
>     retVal += histogramCounts[n-1] * sketch.getN() * ((sketch.getMaxValue() +  quantiles[n-1]) / 2);
>
>     return retVal;
> }
>
>
>
> Regards
> Vijay
>
>
>
>
>
> On Mon, May 2, 2022 at 10:10 PM Will Lauer <wl...@yahooinc.com> wrote:
>
>> OK, this is interesting. I've got some concerns and questions that I've
>> put inline below.
>>
>> Will
>>
>>
>>
>> Will Lauer
>>
>>
>> Senior Principal Architect, Audience & Advertising Reporting
>>
>> Data Platforms & Systems Engineering
>>
>>
>> M 508 561 6427
>>
>> Champaign Office
>>
>> 1908 S. First St
>>
>> Champaign, IL 61822
>>
>>
>>
>> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vi...@gmail.com>
>> wrote:
>>
>>> Thanks Lee. Please find my answers inline below in blue. I think you
>>> will find my use case very interesting. My next endeavor would be to make a
>>> decision tree with entropy / gini impurity measures with sketches. I am
>>> amazed at some of the results I have gotten. You may find this quite
>>> interesting.
>>>
>>> On Sun, May 1, 2022 at 12:55 AM leerho <le...@gmail.com> wrote:
>>>
>>>> Vijay,
>>>> Sorry about the delay in getting back to you.
>>>>
>>>> There is some critical information missing from your description and
>>>> that is the domain of what you are sketching.
>>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>>> If this is the case I think the solution can be achieved in a couple of
>>>> ways.  Going forward with this assumption:
>>>>
>>>
>>>> Your raw data would look something like this:
>>>>
>>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>>
>>>>
>>>>
>>> Actually it would be an event_id instead of user_Id. The domain for the
>>> data is a hypothetical online advertising summary.
>>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>
>>> The problem that we faced in the past was that it was impossible to
>>> "join" a click to an impression. While the raw impression data would have a
>>> unique event_id the raw click data would also have its own event_id. The
>>> dimensions were the same for clicks and impressions. What we did back in
>>> thge day was aggregate the data for impressions separately and then
>>> aggregate for clicks separately and then join.
>>>
>>> You could assume the same here. Each dimension aggregate tuple
>>> {AgeGroup, Gender and Country in out case} would have a theta sketch for
>>> impressions and another for clicks.
>>>
>>
>> I don't see what a "theta sketch for impressions and another for clicks"
>> would mean here. Theta sketches are for doing distinct counts, so I could
>> understand a theta sketch for EventId, counting the number of distinct
>> event ids associated with the AgeGroup/Gender/Country combination. But that
>> doesn't seem to make sense for clicks or impressions. If your original data
>> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
>> implies to me that clicks and impressions would always be 1 (that is, an
>> event is either a single click or single impression). When you drop the
>> EventId dimension, the natural aggregation for these (as well as for
>> AdSpend) is summation, giving you the total clicks, total impressions (and
>> total AdSpend) for the Age/Gender/Country combination. If you were to
>> generate a Theta sketch, it seems like it would be non-sensical, as you all
>> impressions or clicks inserted would be the same value and the theta sketch
>> would always return "1".
>>
>>>
>>>
>>>
>>>> A proposed solution would be to configure a TupleSketch with 3 fields:
>>>>  (Impressions, Clicks, AdSpend).
>>>> (This requires a little programming.  You would need to create classes
>>>> that implement the interfaces Summary,
>>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>>> do this on the website.)
>>>>
>>>>
>>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>>> operations. I did try reading it but I cannot say I understood this
>>> completely. What type of sketch would we use for AdSpend? Would it be
>>> Quantile?
>>>
>>
>>>
>>>
>>>> Then you would process your raw data as follows:
>>>>
>>>>    - Partition your raw data into 213 streams defined by the three
>>>>    dimensions:
>>>>       - 10 X AgeGroup
>>>>       - 3 X Gender
>>>>       - 200 X Country
>>>>    - Each stream would feed a configured TupleSketch as above. (so you
>>>>    need only 213 sketches).
>>>>    - These sketches are in the form of a 3D hypercube.
>>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>>
>>>>
>>> I believe I did something similar. Let me try explaining this with
>>> concrete examples
>>>
>>>    1. This is the raw data
>>>    <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>>    click on the link https://tinyurl.com/3d2r7xjh
>>>    <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>>    ) for a company in India that provides car drivers for hire. The data has
>>>    most columns as categorical variables except the first which is tripId. The
>>>    schema looks something like this
>>>       1. { *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>>       *tripOutcome* }
>>>       2. tripId the first column, is distinct per row which I put into
>>>       theta sketches as shown later
>>>       3. tripOutcome the last column, is a classification variable
>>>       which has values "bad or good"
>>>       4. The problem statement is to dod data analysis of combinations
>>>       of dimensions that produce an unfavourable result that is
>>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>>    2. I was able to populate theta sketches for each dimension and
>>>    value pair (example "city=Mumbai")  and store them as my effective data
>>>    representation.
>>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>>       https://tinyurl.com/bkvz4xrh
>>>       <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>>
>>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>>       https://tinyurl.com/f9mzm8ef
>>>       <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>>
>>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>>       https://tinyurl.com/2p8kwj3m
>>>       <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>>       4. The data looks like this
>>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>>          unique tripId
>>>
>>> I'm not sure what you are sketching here. What is the entry being
>> inserted into the sketch? Basically, what are you counting? Normally, I
>> would think of sketching as taking your original data (  tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>> tripOutcome ) and picking a column (or columns) that you want to count
>> the distinct values of and generating a new table that contains the
>> dimensions you care about plus that new sketch. In this case, if you are
>> counting trips, the resulting table would look like {source,
>> estimate_usage_bins, city, zone, zone_demand_popularity, dayType,
>> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could
>> then report on a variety of combinations by dropping dimensions and
>> aggregating (merging) the sketches.
>>
>>>
>>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>>    <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>>    ] to generate a combination of "dimension=value" with a minimum support
>>>    threshold. The input to this implementation takes the files above at point
>>>    2. https://tinyurl.com/2p8kwj3m
>>>    <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>>    and tripOutcome=good and then join the results. The answers I get are
>>>    listed below
>>>       1. with input having dimension=value and theta sketch with 2^10
>>>       nominal entries answers can be found here
>>>       https://tinyurl.com/yc3a3rde
>>>       <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>>
>>>       2. with input having dimension=value and theta sketch with 2^12
>>>       nominal entries answers can be found here
>>>       https://tinyurl.com/5n7jjr32
>>>       <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>>       3. with input having dimension=value and theta sketch with 2^14
>>>       nominal entries answers can be found here
>>>       https://tinyurl.com/3wrxp4a7
>>>       <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>>    3.  The results show that 2^14 (16K) nominal sketches give
>>>    immaculate results. There are errors in the results from the other sketches
>>>    (RMSE).
>>>    4. Sample of results
>>>       1.
>>>
>>>       *level of combination, rule for combination, bad count, good
>>>       count, percentage of bad count*
>>>       2.
>>>
>>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>>       source=android,26.0,9.0,35.0,74.0
>>>
>>>       3,booking_type=one_way_trip & city=Pune &
>>>       source=android,26.0,10.0,36.0,72.0
>>>
>>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>>       source=ios & zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>>
>>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>>
>>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>>       pickUpHourOfDay=VERYLATE & source=ios &
>>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>>
>>>       3,booking_type=one_way_trip & sla=Scheduled &
>>>       zone=205,14.0,9.0,23.0,61.0
>>>
>>> I am aware from one of Lee's recent youtube videos that theta sketches
>>> work like Sets(Bit sets) when large enough and the data is small. My use
>>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>>> much higher. I could use the jaccard formula explained by Lee to get the
>>> margin of error.
>>>
>>>
>>> Sorry for the details but this should help you suggest if what I am
>>> doing is indeed what tuple sketches does on its own. I am using update
>>> sketches for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>>
>>>
>>>
>>>
>>>> A query would choose the desired coordinates of the 3 dimensions.
>>>> If the query selects more than one coordinate from any one dimension,
>>>> then you would need to first merge the corresponding coordinate
>>>> sketches of that dimension together.
>>>> A missing dimension would mean first merging all coordinate sketches of
>>>> that dimension together.
>>>> The final result sketch is an Intersection of the resulting 3
>>>> dimensional sketches.
>>>>
>>>
>>> Yes. the code here should serve as an example
>>> https://tinyurl.com/4pzfsj6v
>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>>
>>>
>>>
>>>>
>>>> With the resulting Tuple sketch you iterate over all the retained items
>>>> (Summaries) and sum each of the 3 fields.
>>>> Then you divide each sum by Theta.  The result is the estimate of
>>>> Impressions, Clicks, and AdSpend
>>>> for the population of users selected by the query.  The size of that
>>>> population would be obtained by getEstimate().
>>>>
>>>>
>>> My question remains. Which base sketch type should I use for an additive
>>> metric like Ad-spend? Would that be a *quantile sketch*?
>>>
>>
>> I still think this question doesn't make sense. What are you trying to
>> understand with the sketch? Are you trying to find out the distribution of
>> adspend for a particular set of dimension values? Then that sounds like a
>> quantile sketch, Are you trying to find out how many different values of
>> adspend there were? That's a distinct counting sketch. Are you trying to
>> find out the most common ad spend values? That would be a frequent items
>> sketch.
>>
>> But really, none of those actually seem to match what you described
>> above. What you described above sounded like you were simply trying to
>> calculate the total AdSpend, which just requires sum, not a sketch.
>>
>>>
>>>
>>>
>>>> Be aware that Intersections, by their nature, can increase error
>>>> significantly. (see the Website).
>>>> Try to avoid doing intersections with sketch coordinates that have very
>>>> small populations, if possible.
>>>>
>>>
>>> Yes. However if I use very high nominal entries like 2^20 shouldn't this
>>> get mitigated?
>>>
>>
>> The point of using a sketch is estimation. If you can't tolerate any
>> error, you shouldn't be using the sketches in the first place. don't just
>> assume that you SHOULD use a high nominal entries. That said, when using
>> theta sketches and doing intersections, we do sometimes use values like
>> 2^20 for nominal entries, but often it is overkill. Remember, in the case
>> of intersections, you get high error when the intersection size is small
>> compared with the overall set sizes, The case where you see high relative
>> error are the cases where you are looking at small intersections. In most
>> use cases, a relative error of 1,000% or even 10,000%  on an intersection
>> size of 10 when comparing sets of a billion is still a reasonable error
>> given how small the _actual_ value is.
>>
>>>
>>>
>>>
>>>> You can get a sense of how good your results are by utilizing the
>>>> getUpperBound() and getLowerBound()
>>>> (with respect to the user population.).  The bigger that spread is, the
>>>> worse your estimates will be.
>>>>
>>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>>> estimate
>>>
>>>
>>>
>>>> The other option would be to do the cross-product upfront and create
>>>> 6000 Tuple sketches.
>>>> In this case the input tuple would feed only one sketch and the query
>>>> would be to only one sketch.
>>>>
>>>
>>> Since I am not using SQL queries and have my own unique use case, my
>>> questions are quite different.
>>>
>>>
>>>> This can quickly get unwieldy with more dimensions or high cardinality
>>>> dimensions.
>>>> It is more accurate because it avoids the intersections.  It is also
>>>> possible to do a mix of
>>>> these two approaches, but you would need to think it through carefully.
>>>>
>>>> I hope this helps.
>>>>
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>>> vijay.sankar.rajan@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> Just like theta sketches are used for distinct count metrics like
>>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>>
>>>>> There is a big opportunity that I see in storing very little data in
>>>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>>>> data very fast (although it is approximate).
>>>>>
>>>>> This question is best explained with an example. Say I have a fact
>>>>> table schema as follows
>>>>>
>>>>>
>>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>>    {less than 18, 19-25, 26-35....}
>>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>>    3. *Country* is a  dimension with 200 possible values
>>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>>    Theta sketch or HLL)
>>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>>    sketch or HLL)
>>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>>    with quantile sketch??*)
>>>>>
>>>>>
>>>>> The maximum possible number of entries in this table would be the
>>>>> cross product of the cardinalities of the dimensions which is 10(Age_Group)
>>>>> x 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>>> group by queries using set operations that sketches provides.
>>>>>
>>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>>> well.
>>>>> For click metric, one sketch for Gender=F, another sketch for
>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>>> well.
>>>>> Question is what sketch would I need for ad-spend?
>>>>>
>>>>> On a side note, I have used Theta sketches in this manner and even
>>>>> implemented ECLAT for frequent itemset computations and association rule
>>>>> mining ... example from another project below with theta sketches used for
>>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>>
>>>>> Level Conjunction Count
>>>>> 2 H10 & MemberGender=F 74
>>>>> 2 M15 & MemberGender=F 83
>>>>> 2 G31 & MemberGender=F 59
>>>>> 2 MemberGender=F & R13 66
>>>>> *In the example above, H10, M15 etc are International Disease codes
>>>>> for specific diseases. *https://www.aapc.com/codes/code-search/
>>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>>
>>>>> Hope this is a clear representation of the problem.
>>>>>
>>>>> Regards
>>>>> Vijay
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: [E] Re: Using Quantile sketches for additive metrics

Posted by vijay rajan <vi...@gmail.com>.

Hi folks (And Lee),

I think I have found what I was looking for in quantile sketches though I
am not able to formulate error bounds for the same.
I should have raised a PR request but I am going to write the code here.
The code below estimates the volume of the quantile sketche based on the
example
https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html .

The assumption is that the values put in the sketch are all greater than 0.
It kind of works like this.
1. Get 100 quantiles using getQuantiles.
2. Get Histogram percentage numbers
3. The average of the boundaries of the quantiles(getMinValue and
getMaxValue for the extremities)  will be used with getN to estimate the
volume.

Code is straightforward. I think this was what I was looking for and
honestly I do not care if the max and min values are really extreme
compared to the rest of the numbers as my use case is supposed to remove
outliers.

private static double estimateVolume(DoublesSketch sketch) {
    double retVal = 0.0;
    int n = 100;

    double [] quantiles = sketch.getQuantiles(n);
    int totalItems = sketch.getRetainedItems();
    double [] histogramCounts = new double[n];

    //Arrays.sort(quantiles);
    histogramCounts = sketch.getPMF(quantiles);

    //compute the volume
    retVal += histogramCounts[0] * sketch.getN() *
((sketch.getMinValue() +  quantiles[0]) / 2);
    for (int i = 1; i < n-1; i++) {
        retVal += histogramCounts[i] * sketch.getN() * ((
quantiles[i-1]+  quantiles[i]) / 2);
    }
    retVal += histogramCounts[n-1] * sketch.getN() *
((sketch.getMaxValue() +  quantiles[n-1]) / 2);

    return retVal;
}



Regards
Vijay





On Mon, May 2, 2022 at 10:10 PM Will Lauer <wl...@yahooinc.com> wrote:

> OK, this is interesting. I've got some concerns and questions that I've
> put inline below.
>
> Will
>
>
>
> Will Lauer
>
>
> Senior Principal Architect, Audience & Advertising Reporting
>
> Data Platforms & Systems Engineering
>
>
> M 508 561 6427
>
> Champaign Office
>
> 1908 S. First St
>
> Champaign, IL 61822
>
>
>
> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vi...@gmail.com>
> wrote:
>
>> Thanks Lee. Please find my answers inline below in blue. I think you
>> will find my use case very interesting. My next endeavor would be to make a
>> decision tree with entropy / gini impurity measures with sketches. I am
>> amazed at some of the results I have gotten. You may find this quite
>> interesting.
>>
>> On Sun, May 1, 2022 at 12:55 AM leerho <le...@gmail.com> wrote:
>>
>>> Vijay,
>>> Sorry about the delay in getting back to you.
>>>
>>> There is some critical information missing from your description and
>>> that is the domain of what you are sketching.
>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>> If this is the case I think the solution can be achieved in a couple of
>>> ways.  Going forward with this assumption:
>>>
>>
>>> Your raw data would look something like this:
>>>
>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>
>>>
>>>
>> Actually it would be an event_id instead of user_Id. The domain for the
>> data is a hypothetical online advertising summary.
>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>
>> The problem that we faced in the past was that it was impossible to
>> "join" a click to an impression. While the raw impression data would have a
>> unique event_id the raw click data would also have its own event_id. The
>> dimensions were the same for clicks and impressions. What we did back in
>> thge day was aggregate the data for impressions separately and then
>> aggregate for clicks separately and then join.
>>
>> You could assume the same here. Each dimension aggregate tuple {AgeGroup,
>> Gender and Country in out case} would have a theta sketch for impressions
>> and another for clicks.
>>
>
> I don't see what a "theta sketch for impressions and another for clicks"
> would mean here. Theta sketches are for doing distinct counts, so I could
> understand a theta sketch for EventId, counting the number of distinct
> event ids associated with the AgeGroup/Gender/Country combination. But that
> doesn't seem to make sense for clicks or impressions. If your original data
> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
> implies to me that clicks and impressions would always be 1 (that is, an
> event is either a single click or single impression). When you drop the
> EventId dimension, the natural aggregation for these (as well as for
> AdSpend) is summation, giving you the total clicks, total impressions (and
> total AdSpend) for the Age/Gender/Country combination. If you were to
> generate a Theta sketch, it seems like it would be non-sensical, as you all
> impressions or clicks inserted would be the same value and the theta sketch
> would always return "1".
>
>>
>>
>>
>>> A proposed solution would be to configure a TupleSketch with 3 fields:  (Impressions,
>>> Clicks, AdSpend).
>>> (This requires a little programming.  You would need to create classes
>>> that implement the interfaces Summary,
>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>> do this on the website.)
>>>
>>>
>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>> operations. I did try reading it but I cannot say I understood this
>> completely. What type of sketch would we use for AdSpend? Would it be
>> Quantile?
>>
>
>>
>>
>>> Then you would process your raw data as follows:
>>>
>>>    - Partition your raw data into 213 streams defined by the three
>>>    dimensions:
>>>       - 10 X AgeGroup
>>>       - 3 X Gender
>>>       - 200 X Country
>>>    - Each stream would feed a configured TupleSketch as above. (so you
>>>    need only 213 sketches).
>>>    - These sketches are in the form of a 3D hypercube.
>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>
>>>
>> I believe I did something similar. Let me try explaining this with
>> concrete examples
>>
>>    1. This is the raw data
>>    <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>    click on the link https://tinyurl.com/3d2r7xjh
>>    <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>    ) for a company in India that provides car drivers for hire. The data has
>>    most columns as categorical variables except the first which is tripId. The
>>    schema looks something like this
>>       1. { *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>       *tripOutcome* }
>>       2. tripId the first column, is distinct per row which I put into
>>       theta sketches as shown later
>>       3. tripOutcome the last column, is a classification variable which
>>       has values "bad or good"
>>       4. The problem statement is to dod data analysis of combinations
>>       of dimensions that produce an unfavourable result that is
>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>    2. I was able to populate theta sketches for each dimension and value
>>    pair (example "city=Mumbai")  and store them as my effective data
>>    representation.
>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>       https://tinyurl.com/bkvz4xrh
>>       <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>
>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>       https://tinyurl.com/f9mzm8ef
>>       <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>
>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>       https://tinyurl.com/2p8kwj3m
>>       <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>       4. The data looks like this
>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>          unique tripId
>>
>> I'm not sure what you are sketching here. What is the entry being
> inserted into the sketch? Basically, what are you counting? Normally, I
> would think of sketching as taking your original data (  tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
> tripOutcome ) and picking a column (or columns) that you want to count
> the distinct values of and generating a new table that contains the
> dimensions you care about plus that new sketch. In this case, if you are
> counting trips, the resulting table would look like {source,
> estimate_usage_bins, city, zone, zone_demand_popularity, dayType,
> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could
> then report on a variety of combinations by dropping dimensions and
> aggregating (merging) the sketches.
>
>>
>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>    <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>    ] to generate a combination of "dimension=value" with a minimum support
>>    threshold. The input to this implementation takes the files above at point
>>    2. https://tinyurl.com/2p8kwj3m
>>    <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>    and tripOutcome=good and then join the results. The answers I get are
>>    listed below
>>       1. with input having dimension=value and theta sketch with 2^10
>>       nominal entries answers can be found here
>>       https://tinyurl.com/yc3a3rde
>>       <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>
>>       2. with input having dimension=value and theta sketch with 2^12
>>       nominal entries answers can be found here
>>       https://tinyurl.com/5n7jjr32
>>       <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>       3. with input having dimension=value and theta sketch with 2^14
>>       nominal entries answers can be found here
>>       https://tinyurl.com/3wrxp4a7
>>       <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>    3.  The results show that 2^14 (16K) nominal sketches give immaculate
>>    results. There are errors in the results from the other sketches (RMSE).
>>    4. Sample of results
>>       1.
>>
>>       *level of combination, rule for combination, bad count, good
>>       count, percentage of bad count*
>>       2.
>>
>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>       source=android,26.0,9.0,35.0,74.0
>>
>>       3,booking_type=one_way_trip & city=Pune &
>>       source=android,26.0,10.0,36.0,72.0
>>
>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>       source=ios & zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>
>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>
>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>       pickUpHourOfDay=VERYLATE & source=ios &
>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>
>>       3,booking_type=one_way_trip & sla=Scheduled &
>>       zone=205,14.0,9.0,23.0,61.0
>>
>> I am aware from one of Lee's recent youtube videos that theta sketches
>> work like Sets(Bit sets) when large enough and the data is small. My use
>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>> much higher. I could use the jaccard formula explained by Lee to get the
>> margin of error.
>>
>>
>> Sorry for the details but this should help you suggest if what I am doing
>> is indeed what tuple sketches does on its own. I am using update sketches
>> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>
>>
>>
>>
>>> A query would choose the desired coordinates of the 3 dimensions.
>>> If the query selects more than one coordinate from any one dimension,
>>> then you would need to first merge the corresponding coordinate sketches
>>> of that dimension together.
>>> A missing dimension would mean first merging all coordinate sketches of
>>> that dimension together.
>>> The final result sketch is an Intersection of the resulting 3
>>> dimensional sketches.
>>>
>>
>> Yes. the code here should serve as an example
>> https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>
>>
>>
>>>
>>> With the resulting Tuple sketch you iterate over all the retained items
>>> (Summaries) and sum each of the 3 fields.
>>> Then you divide each sum by Theta.  The result is the estimate of
>>> Impressions, Clicks, and AdSpend
>>> for the population of users selected by the query.  The size of that
>>> population would be obtained by getEstimate().
>>>
>>>
>> My question remains. Which base sketch type should I use for an additive
>> metric like Ad-spend? Would that be a *quantile sketch*?
>>
>
> I still think this question doesn't make sense. What are you trying to
> understand with the sketch? Are you trying to find out the distribution of
> adspend for a particular set of dimension values? Then that sounds like a
> quantile sketch, Are you trying to find out how many different values of
> adspend there were? That's a distinct counting sketch. Are you trying to
> find out the most common ad spend values? That would be a frequent items
> sketch.
>
> But really, none of those actually seem to match what you described above.
> What you described above sounded like you were simply trying to calculate
> the total AdSpend, which just requires sum, not a sketch.
>
>>
>>
>>
>>> Be aware that Intersections, by their nature, can increase error
>>> significantly. (see the Website).
>>> Try to avoid doing intersections with sketch coordinates that have very
>>> small populations, if possible.
>>>
>>
>> Yes. However if I use very high nominal entries like 2^20 shouldn't this
>> get mitigated?
>>
>
> The point of using a sketch is estimation. If you can't tolerate any
> error, you shouldn't be using the sketches in the first place. don't just
> assume that you SHOULD use a high nominal entries. That said, when using
> theta sketches and doing intersections, we do sometimes use values like
> 2^20 for nominal entries, but often it is overkill. Remember, in the case
> of intersections, you get high error when the intersection size is small
> compared with the overall set sizes, The case where you see high relative
> error are the cases where you are looking at small intersections. In most
> use cases, a relative error of 1,000% or even 10,000%  on an intersection
> size of 10 when comparing sets of a billion is still a reasonable error
> given how small the _actual_ value is.
>
>>
>>
>>
>>> You can get a sense of how good your results are by utilizing the
>>> getUpperBound() and getLowerBound()
>>> (with respect to the user population.).  The bigger that spread is, the
>>> worse your estimates will be.
>>>
>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>> estimate
>>
>>
>>
>>> The other option would be to do the cross-product upfront and create
>>> 6000 Tuple sketches.
>>> In this case the input tuple would feed only one sketch and the query
>>> would be to only one sketch.
>>>
>>
>> Since I am not using SQL queries and have my own unique use case, my
>> questions are quite different.
>>
>>
>>> This can quickly get unwieldy with more dimensions or high cardinality
>>> dimensions.
>>> It is more accurate because it avoids the intersections.  It is also
>>> possible to do a mix of
>>> these two approaches, but you would need to think it through carefully.
>>>
>>> I hope this helps.
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>> vijay.sankar.rajan@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Just like theta sketches are used for distinct count metrics like
>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>
>>>> There is a big opportunity that I see in storing very little data in
>>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>>> data very fast (although it is approximate).
>>>>
>>>> This question is best explained with an example. Say I have a fact
>>>> table schema as follows
>>>>
>>>>
>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>    {less than 18, 19-25, 26-35....}
>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>    3. *Country* is a  dimension with 200 possible values
>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>    Theta sketch or HLL)
>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>    sketch or HLL)
>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>    with quantile sketch??*)
>>>>
>>>>
>>>> The maximum possible number of entries in this table would be the cross
>>>> product of the cardinalities of the dimensions which is 10(Age_Group) x
>>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>> group by queries using set operations that sketches provides.
>>>>
>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>> well.
>>>> For click metric, one sketch for Gender=F, another sketch for Gender=M,
>>>> yet another for Gender=Unknown and so on for other dimensions as well.
>>>> Question is what sketch would I need for ad-spend?
>>>>
>>>> On a side note, I have used Theta sketches in this manner and even
>>>> implemented ECLAT for frequent itemset computations and association rule
>>>> mining ... example from another project below with theta sketches used for
>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>
>>>> Level Conjunction Count
>>>> 2 H10 & MemberGender=F 74
>>>> 2 M15 & MemberGender=F 83
>>>> 2 G31 & MemberGender=F 59
>>>> 2 MemberGender=F & R13 66
>>>> *In the example above, H10, M15 etc are International Disease codes for
>>>> specific diseases. *https://www.aapc.com/codes/code-search/
>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>
>>>> Hope this is a clear representation of the problem.
>>>>
>>>> Regards
>>>> Vijay
>>>>
>>>>
>>>>
>>>>
>>>>

Re: [E] Re: Using Quantile sketches for additive metrics

Posted by vijay rajan <vi...@gmail.com>.

Thanks Will. Please find my reply in-line below.
But just to stay in line with my original question of a sketch for additive
metrics, is that I can use such a sketch for on-the-fly aggregation by
storing one such sketch per "dimension=value" pair without having to go to
the table for aggregation.

I would separately like to share with Lee, you or whomever you nominate to
explain the experiments that I have performed. You may find my learnings
very interesting.

Really hard to explain in words. Perhaps a call. Am more than happy to
share code and the thought process.



On Mon, May 2, 2022 at 10:10 PM Will Lauer <wl...@yahooinc.com> wrote:

> OK, this is interesting. I've got some concerns and questions that I've
> put inline below.
>
> Will
>
>
>
> Will Lauer
>
>
> Senior Principal Architect, Audience & Advertising Reporting
>
> Data Platforms & Systems Engineering
>
>
> M 508 561 6427
>
> Champaign Office
>
> 1908 S. First St
>
> Champaign, IL 61822
>
>
>
> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vi...@gmail.com>
> wrote:
>
>> Thanks Lee. Please find my answers inline below in blue. I think you
>> will find my use case very interesting. My next endeavor would be to make a
>> decision tree with entropy / gini impurity measures with sketches. I am
>> amazed at some of the results I have gotten. You may find this quite
>> interesting.
>>
>> On Sun, May 1, 2022 at 12:55 AM leerho <le...@gmail.com> wrote:
>>
>>> Vijay,
>>> Sorry about the delay in getting back to you.
>>>
>>> There is some critical information missing from your description and
>>> that is the domain of what you are sketching.
>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>> If this is the case I think the solution can be achieved in a couple of
>>> ways.  Going forward with this assumption:
>>>
>>
>>> Your raw data would look something like this:
>>>
>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>
>>>
>>>
>> Actually it would be an event_id instead of user_Id. The domain for the
>> data is a hypothetical online advertising summary.
>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>
>> The problem that we faced in the past was that it was impossible to
>> "join" a click to an impression. While the raw impression data would have a
>> unique event_id the raw click data would also have its own event_id. The
>> dimensions were the same for clicks and impressions. What we did back in
>> thge day was aggregate the data for impressions separately and then
>> aggregate for clicks separately and then join.
>>
>> You could assume the same here. Each dimension aggregate tuple {AgeGroup,
>> Gender and Country in out case} would have a theta sketch for impressions
>> and another for clicks.
>>
>
> I don't see what a "theta sketch for impressions and another for clicks"
> would mean here. Theta sketches are for doing distinct counts, so I could
> understand a theta sketch for EventId, counting the number of distinct
> event ids associated with the AgeGroup/Gender/Country combination. But that
> doesn't seem to make sense for clicks or impressions.
>

Just as a background, back in the day the only way to join aggregates for
clicks and aggregates for impressions was via the dimensions. What this
means is that a tuple in impressions aggregation table with AgeGroup=30-40,
Gender=F, Country=UK would be joined with the tuple AgeGroup=30-40,
Gender=F, Country=UK in the clicks aggregation table for the same hour.
There was no way to find the exact impression and its corresponding click.

So the schema for the click stream would be {EventId, AgeGroup, Gender,
Country} while the schema for the impression stream would be identical
{EventId, AgeGroup, Gender, Country}. Note that the EventId in click has no
relationship with EventId of impressions.


> If your original data was {EventId, AgeGroup, Gender, Country,
> Impressions, Clicks, AdSpend}, it implies to me that clicks and
> impressions would always be 1 (that is, an event is either a single click
> or single impression). When you drop the EventId dimension, the natural
> aggregation for these (as well as for AdSpend) is summation, giving you the
> total clicks, total impressions (and total AdSpend) for the
> Age/Gender/Country combination. If you were to generate a Theta sketch, it
> seems like it would be non-sensical, as you all impressions or clicks
> inserted would be the same value and the theta sketch would always return
> "1".
>
>>
>>
>>
>
Let me try to explain it this way, the EventId is a distinct id for an
impression or a click. Look at it as a UUID of some sort. While building a
theta sketch for the impression stream for each of the dimension=value
combination I use the eventId to update the sketch. Example:  One Theta
sketch for AgeGroup=x1 another theta sketch AgeGroup=x2 and so on. If you
open this file https://tinyurl.com/2p8kwj3m
<https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
it
will make sense. It is nothing but a "dimension=value,Base64 Theta
serialized theta sketch".



> A proposed solution would be to configure a TupleSketch with 3 fields:  (Impressions,
>>> Clicks, AdSpend).
>>> (This requires a little programming.  You would need to create classes
>>> that implement the interfaces Summary,
>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>> do this on the website.)
>>>
>>>
>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>> operations. I did try reading it but I cannot say I understood this
>> completely. What type of sketch would we use for AdSpend? Would it be
>> Quantile?
>>
>
>>
>>
>>> Then you would process your raw data as follows:
>>>
>>>    - Partition your raw data into 213 streams defined by the three
>>>    dimensions:
>>>       - 10 X AgeGroup
>>>       - 3 X Gender
>>>       - 200 X Country
>>>    - Each stream would feed a configured TupleSketch as above. (so you
>>>    need only 213 sketches).
>>>    - These sketches are in the form of a 3D hypercube.
>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>
>>>
>> I believe I did something similar. Let me try explaining this with
>> concrete examples
>>
>>    1. This is the raw data
>>    <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>    click on the link https://tinyurl.com/3d2r7xjh
>>    <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>    ) for a company in India that provides car drivers for hire. The data has
>>    most columns as categorical variables except the first which is tripId. The
>>    schema looks something like this
>>       1. { *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>       *tripOutcome* }
>>       2. tripId the first column, is distinct per row which I put into
>>       theta sketches as shown later
>>       3. tripOutcome the last column, is a classification variable which
>>       has values "bad or good"
>>       4. The problem statement is to dod data analysis of combinations
>>       of dimensions that produce an unfavourable result that is
>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>    2. I was able to populate theta sketches for each dimension and value
>>    pair (example "city=Mumbai")  and store them as my effective data
>>    representation.
>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>       https://tinyurl.com/bkvz4xrh
>>       <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>
>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>       https://tinyurl.com/f9mzm8ef
>>       <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>
>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>       https://tinyurl.com/2p8kwj3m
>>       <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>       4. The data looks like this
>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>          unique tripId
>>
>> I'm not sure what you are sketching here. What is the entry being
> inserted into the sketch?
>
The distinct event id is inserted in the sketch.


> Basically, what are you counting?
>
I would like to estimate the distinct number of impressions for any
combination example "AgeGroup=x1 & Gender=M" how many impressions were
served.


Normally, I would think of sketching as taking your original data (
tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
> tripOutcome ) and picking a column (or columns) that you want to count
> the distinct values of and generating a new table that contains the
> dimensions you care about plus that new sketch.
>

I am actually taking it a notch up. I am using theta sketch as a BitVector
to hold an estimate of the number of clicks (or number of impressions) and
using it to do data mining and association rule mining. I would be thrilled
to get on a call and explain. Please feel free to schedule a call from
10:00 pm to 12:00 pm IST on any weekday. I am in Bangalore.


> In this case, if you are counting trips, the resulting table would look
> like {source, estimate_usage_bins, city, zone, zone_demand_popularity,
> dayType, pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}.
> You could then report on a variety of combinations by dropping dimensions
> and aggregating (merging) the sketches.
>

The best example of what I am trying to achieve is
https://tinyurl.com/2p8kwj3m
<https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
 .


>
>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>    <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>    ] to generate a combination of "dimension=value" with a minimum support
>>    threshold. The input to this implementation takes the files above at point
>>    2. https://tinyurl.com/2p8kwj3m
>>    <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>    and tripOutcome=good and then join the results. The answers I get are
>>    listed below
>>       1. with input having dimension=value and theta sketch with 2^10
>>       nominal entries answers can be found here
>>       https://tinyurl.com/yc3a3rde
>>       <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>
>>       2. with input having dimension=value and theta sketch with 2^12
>>       nominal entries answers can be found here
>>       https://tinyurl.com/5n7jjr32
>>       <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>       3. with input having dimension=value and theta sketch with 2^14
>>       nominal entries answers can be found here
>>       https://tinyurl.com/3wrxp4a7
>>       <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>    3.  The results show that 2^14 (16K) nominal sketches give immaculate
>>    results. There are errors in the results from the other sketches (RMSE).
>>    4. Sample of results
>>       1.
>>
>>       *level of combination, rule for combination, bad count, good
>>       count, percentage of bad count*
>>       2.
>>
>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>       source=android,26.0,9.0,35.0,74.0
>>
>>       3,booking_type=one_way_trip & city=Pune &
>>       source=android,26.0,10.0,36.0,72.0
>>
>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>       source=ios & zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>
>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>
>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>       pickUpHourOfDay=VERYLATE & source=ios &
>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>
>>       3,booking_type=one_way_trip & sla=Scheduled &
>>       zone=205,14.0,9.0,23.0,61.0
>>
>> I am aware from one of Lee's recent youtube videos that theta sketches
>> work like Sets(Bit sets) when large enough and the data is small. My use
>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>> much higher. I could use the jaccard formula explained by Lee to get the
>> margin of error.
>>
>>
>> Sorry for the details but this should help you suggest if what I am doing
>> is indeed what tuple sketches does on its own. I am using update sketches
>> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>
>>
>>
>>
>>> A query would choose the desired coordinates of the 3 dimensions.
>>> If the query selects more than one coordinate from any one dimension,
>>> then you would need to first merge the corresponding coordinate sketches
>>> of that dimension together.
>>> A missing dimension would mean first merging all coordinate sketches of
>>> that dimension together.
>>> The final result sketch is an Intersection of the resulting 3
>>> dimensional sketches.
>>>
>>
>> Yes. the code here should serve as an example
>> https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>
>>
>>
>>>
>>> With the resulting Tuple sketch you iterate over all the retained items
>>> (Summaries) and sum each of the 3 fields.
>>> Then you divide each sum by Theta.  The result is the estimate of
>>> Impressions, Clicks, and AdSpend
>>> for the population of users selected by the query.  The size of that
>>> population would be obtained by getEstimate().
>>>
>>>
>> My question remains. Which base sketch type should I use for an additive
>> metric like Ad-spend? Would that be a *quantile sketch*?
>>
>
> I still think this question doesn't make sense. What are you trying to
> understand with the sketch? Are you trying to find out the distribution of
> adspend for a particular set of dimension values? Then that sounds like a
> quantile sketch, Are you trying to find out how many different values of
> adspend there were? That's a distinct counting sketch. Are you trying to
> find out the most common ad spend values? That would be a frequent items
> sketch.
>
> But really, none of those actually seem to match what you described above.
> What you described above sounded like you were simply trying to calculate
> the total AdSpend, which just requires sum, not a sketch.
>

In the big data world where one has millions of rows of event data, if we
could approximate impressions and clicks by having a sketch for every
dimension=value pair, we would make analytics very fast though approximate.




>
>>
>>
>>> Be aware that Intersections, by their nature, can increase error
>>> significantly. (see the Website).
>>> Try to avoid doing intersections with sketch coordinates that have very
>>> small populations, if possible.
>>>
>>
>> Yes. However if I use very high nominal entries like 2^20 shouldn't this
>> get mitigated?
>>
>
> The point of using a sketch is estimation. If you can't tolerate any
> error, you shouldn't be using the sketches in the first place. don't just
> assume that you SHOULD use a high nominal entries. That said, when using
> theta sketches and doing intersections, we do sometimes use values like
> 2^20 for nominal entries, but often it is overkill. Remember, in the case
> of intersections, you get high error when the intersection size is small
> compared with the overall set sizes, The case where you see high relative
> error are the cases where you are looking at small intersections. In most
> use cases, a relative error of 1,000% or even 10,000%  on an intersection
> size of 10 when comparing sets of a billion is still a reasonable error
> given how small the _actual_ value is.
>

It's not a zero or one thing. I can tolerate a few frequent item sets being
off by 5% but not more than that. I have understood the LowerBound and
UpperBound mechanism explained by Lee.



>
>>
>>
>>> You can get a sense of how good your results are by utilizing the
>>> getUpperBound() and getLowerBound()
>>> (with respect to the user population.).  The bigger that spread is, the
>>> worse your estimates will be.
>>>
>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>> estimate
>>
>>
>>
>>> The other option would be to do the cross-product upfront and create
>>> 6000 Tuple sketches.
>>> In this case the input tuple would feed only one sketch and the query
>>> would be to only one sketch.
>>>
>>
>> Since I am not using SQL queries and have my own unique use case, my
>> questions are quite different.
>>
>>
>>> This can quickly get unwieldy with more dimensions or high cardinality
>>> dimensions.
>>> It is more accurate because it avoids the intersections.  It is also
>>> possible to do a mix of
>>> these two approaches, but you would need to think it through carefully.
>>>
>>> I hope this helps.
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>> vijay.sankar.rajan@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Just like theta sketches are used for distinct count metrics like
>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>
>>>> There is a big opportunity that I see in storing very little data in
>>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>>> data very fast (although it is approximate).
>>>>
>>>> This question is best explained with an example. Say I have a fact
>>>> table schema as follows
>>>>
>>>>
>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>    {less than 18, 19-25, 26-35....}
>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>    3. *Country* is a  dimension with 200 possible values
>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>    Theta sketch or HLL)
>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>    sketch or HLL)
>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>    with quantile sketch??*)
>>>>
>>>>
>>>> The maximum possible number of entries in this table would be the cross
>>>> product of the cardinalities of the dimensions which is 10(Age_Group) x
>>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>> group by queries using set operations that sketches provides.
>>>>
>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>> well.
>>>> For click metric, one sketch for Gender=F, another sketch for Gender=M,
>>>> yet another for Gender=Unknown and so on for other dimensions as well.
>>>> Question is what sketch would I need for ad-spend?
>>>>
>>>> On a side note, I have used Theta sketches in this manner and even
>>>> implemented ECLAT for frequent itemset computations and association rule
>>>> mining ... example from another project below with theta sketches used for
>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>
>>>> Level Conjunction Count
>>>> 2 H10 & MemberGender=F 74
>>>> 2 M15 & MemberGender=F 83
>>>> 2 G31 & MemberGender=F 59
>>>> 2 MemberGender=F & R13 66
>>>> *In the example above, H10, M15 etc are International Disease codes for
>>>> specific diseases. *https://www.aapc.com/codes/code-search/
>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>
>>>> Hope this is a clear representation of the problem.
>>>>
>>>> Regards
>>>> Vijay
>>>>
>>>>
>>>>
>>>>
>>>>

Re: [E] Re: Using Quantile sketches for additive metrics

Posted by vijay rajan <vi...@gmail.com>.

Hi Will,

Thanks for your response. I will send my clarifications in a day or two.
Please do look at my detailed explanation & look at the datasets and
results that I have shared. You should understand what I am trying to do.

Essentially, an event_Id is a uuid for an event. A click stream will have
an event id as will an impression stream.


Just as a background, when I was a Yahoo in 2014, I had explained a problem
with a requirement for approximate distribution to Lee who explained
quantile sketches.

I have used HLL & theta sketches for quite some time now. Changing the need
from distinct counts of user ids to using it as an "appropriate set" with
event ids is what I am looking at(asking in this case).


Regarding ad spend, it is to use a sketch with distribution &
approximations including getting additive sums from the sketch.

Happy to discuss further on a call.


Vijay


On Mon, May 2, 2022, 22:10 Will Lauer <wl...@yahooinc.com> wrote:

> OK, this is interesting. I've got some concerns and questions that I've
> put inline below.
>
> Will
>
>
>
> Will Lauer
>
>
> Senior Principal Architect, Audience & Advertising Reporting
>
> Data Platforms & Systems Engineering
>
>
> M 508 561 6427
>
> Champaign Office
>
> 1908 S. First St
>
> Champaign, IL 61822
>
>
>
> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vi...@gmail.com>
> wrote:
>
>> Thanks Lee. Please find my answers inline below in blue. I think you
>> will find my use case very interesting. My next endeavor would be to make a
>> decision tree with entropy / gini impurity measures with sketches. I am
>> amazed at some of the results I have gotten. You may find this quite
>> interesting.
>>
>> On Sun, May 1, 2022 at 12:55 AM leerho <le...@gmail.com> wrote:
>>
>>> Vijay,
>>> Sorry about the delay in getting back to you.
>>>
>>> There is some critical information missing from your description and
>>> that is the domain of what you are sketching.
>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>> If this is the case I think the solution can be achieved in a couple of
>>> ways.  Going forward with this assumption:
>>>
>>
>>> Your raw data would look something like this:
>>>
>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>
>>>
>>>
>> Actually it would be an event_id instead of user_Id. The domain for the
>> data is a hypothetical online advertising summary.
>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>
>> The problem that we faced in the past was that it was impossible to
>> "join" a click to an impression. While the raw impression data would have a
>> unique event_id the raw click data would also have its own event_id. The
>> dimensions were the same for clicks and impressions. What we did back in
>> thge day was aggregate the data for impressions separately and then
>> aggregate for clicks separately and then join.
>>
>> You could assume the same here. Each dimension aggregate tuple {AgeGroup,
>> Gender and Country in out case} would have a theta sketch for impressions
>> and another for clicks.
>>
>
> I don't see what a "theta sketch for impressions and another for clicks"
> would mean here. Theta sketches are for doing distinct counts, so I could
> understand a theta sketch for EventId, counting the number of distinct
> event ids associated with the AgeGroup/Gender/Country combination. But that
> doesn't seem to make sense for clicks or impressions. If your original data
> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
> implies to me that clicks and impressions would always be 1 (that is, an
> event is either a single click or single impression). When you drop the
> EventId dimension, the natural aggregation for these (as well as for
> AdSpend) is summation, giving you the total clicks, total impressions (and
> total AdSpend) for the Age/Gender/Country combination. If you were to
> generate a Theta sketch, it seems like it would be non-sensical, as you all
> impressions or clicks inserted would be the same value and the theta sketch
> would always return "1".
>
>>
>>
>>
>>> A proposed solution would be to configure a TupleSketch with 3 fields:  (Impressions,
>>> Clicks, AdSpend).
>>> (This requires a little programming.  You would need to create classes
>>> that implement the interfaces Summary,
>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>> do this on the website.)
>>>
>>>
>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>> operations. I did try reading it but I cannot say I understood this
>> completely. What type of sketch would we use for AdSpend? Would it be
>> Quantile?
>>
>
>>
>>
>>> Then you would process your raw data as follows:
>>>
>>>    - Partition your raw data into 213 streams defined by the three
>>>    dimensions:
>>>       - 10 X AgeGroup
>>>       - 3 X Gender
>>>       - 200 X Country
>>>    - Each stream would feed a configured TupleSketch as above. (so you
>>>    need only 213 sketches).
>>>    - These sketches are in the form of a 3D hypercube.
>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>
>>>
>> I believe I did something similar. Let me try explaining this with
>> concrete examples
>>
>>    1. This is the raw data
>>    <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>    click on the link https://tinyurl.com/3d2r7xjh
>>    <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>    ) for a company in India that provides car drivers for hire. The data has
>>    most columns as categorical variables except the first which is tripId. The
>>    schema looks something like this
>>       1. { *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>       *tripOutcome* }
>>       2. tripId the first column, is distinct per row which I put into
>>       theta sketches as shown later
>>       3. tripOutcome the last column, is a classification variable which
>>       has values "bad or good"
>>       4. The problem statement is to dod data analysis of combinations
>>       of dimensions that produce an unfavourable result that is
>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>    2. I was able to populate theta sketches for each dimension and value
>>    pair (example "city=Mumbai")  and store them as my effective data
>>    representation.
>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>       https://tinyurl.com/bkvz4xrh
>>       <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>
>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>       https://tinyurl.com/f9mzm8ef
>>       <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>
>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>       https://tinyurl.com/2p8kwj3m
>>       <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>       4. The data looks like this
>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>          unique tripId
>>
>> I'm not sure what you are sketching here. What is the entry being
> inserted into the sketch? Basically, what are you counting? Normally, I
> would think of sketching as taking your original data (  tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
> tripOutcome ) and picking a column (or columns) that you want to count
> the distinct values of and generating a new table that contains the
> dimensions you care about plus that new sketch. In this case, if you are
> counting trips, the resulting table would look like {source,
> estimate_usage_bins, city, zone, zone_demand_popularity, dayType,
> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could
> then report on a variety of combinations by dropping dimensions and
> aggregating (merging) the sketches.
>
>>
>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>    <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>    ] to generate a combination of "dimension=value" with a minimum support
>>    threshold. The input to this implementation takes the files above at point
>>    2. https://tinyurl.com/2p8kwj3m
>>    <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>    and tripOutcome=good and then join the results. The answers I get are
>>    listed below
>>       1. with input having dimension=value and theta sketch with 2^10
>>       nominal entries answers can be found here
>>       https://tinyurl.com/yc3a3rde
>>       <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>
>>       2. with input having dimension=value and theta sketch with 2^12
>>       nominal entries answers can be found here
>>       https://tinyurl.com/5n7jjr32
>>       <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>       3. with input having dimension=value and theta sketch with 2^14
>>       nominal entries answers can be found here
>>       https://tinyurl.com/3wrxp4a7
>>       <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>    3.  The results show that 2^14 (16K) nominal sketches give immaculate
>>    results. There are errors in the results from the other sketches (RMSE).
>>    4. Sample of results
>>       1.
>>
>>       *level of combination, rule for combination, bad count, good
>>       count, percentage of bad count*
>>       2.
>>
>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>       source=android,26.0,9.0,35.0,74.0
>>
>>       3,booking_type=one_way_trip & city=Pune &
>>       source=android,26.0,10.0,36.0,72.0
>>
>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>       source=ios & zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>
>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>
>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>       pickUpHourOfDay=VERYLATE & source=ios &
>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>
>>       3,booking_type=one_way_trip & sla=Scheduled &
>>       zone=205,14.0,9.0,23.0,61.0
>>
>> I am aware from one of Lee's recent youtube videos that theta sketches
>> work like Sets(Bit sets) when large enough and the data is small. My use
>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>> much higher. I could use the jaccard formula explained by Lee to get the
>> margin of error.
>>
>>
>> Sorry for the details but this should help you suggest if what I am doing
>> is indeed what tuple sketches does on its own. I am using update sketches
>> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>
>>
>>
>>
>>> A query would choose the desired coordinates of the 3 dimensions.
>>> If the query selects more than one coordinate from any one dimension,
>>> then you would need to first merge the corresponding coordinate sketches
>>> of that dimension together.
>>> A missing dimension would mean first merging all coordinate sketches of
>>> that dimension together.
>>> The final result sketch is an Intersection of the resulting 3
>>> dimensional sketches.
>>>
>>
>> Yes. the code here should serve as an example
>> https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>
>>
>>
>>>
>>> With the resulting Tuple sketch you iterate over all the retained items
>>> (Summaries) and sum each of the 3 fields.
>>> Then you divide each sum by Theta.  The result is the estimate of
>>> Impressions, Clicks, and AdSpend
>>> for the population of users selected by the query.  The size of that
>>> population would be obtained by getEstimate().
>>>
>>>
>> My question remains. Which base sketch type should I use for an additive
>> metric like Ad-spend? Would that be a *quantile sketch*?
>>
>
> I still think this question doesn't make sense. What are you trying to
> understand with the sketch? Are you trying to find out the distribution of
> adspend for a particular set of dimension values? Then that sounds like a
> quantile sketch, Are you trying to find out how many different values of
> adspend there were? That's a distinct counting sketch. Are you trying to
> find out the most common ad spend values? That would be a frequent items
> sketch.
>
> But really, none of those actually seem to match what you described above.
> What you described above sounded like you were simply trying to calculate
> the total AdSpend, which just requires sum, not a sketch.
>
>>
>>
>>
>>> Be aware that Intersections, by their nature, can increase error
>>> significantly. (see the Website).
>>> Try to avoid doing intersections with sketch coordinates that have very
>>> small populations, if possible.
>>>
>>
>> Yes. However if I use very high nominal entries like 2^20 shouldn't this
>> get mitigated?
>>
>
> The point of using a sketch is estimation. If you can't tolerate any
> error, you shouldn't be using the sketches in the first place. don't just
> assume that you SHOULD use a high nominal entries. That said, when using
> theta sketches and doing intersections, we do sometimes use values like
> 2^20 for nominal entries, but often it is overkill. Remember, in the case
> of intersections, you get high error when the intersection size is small
> compared with the overall set sizes, The case where you see high relative
> error are the cases where you are looking at small intersections. In most
> use cases, a relative error of 1,000% or even 10,000%  on an intersection
> size of 10 when comparing sets of a billion is still a reasonable error
> given how small the _actual_ value is.
>
>>
>>
>>
>>> You can get a sense of how good your results are by utilizing the
>>> getUpperBound() and getLowerBound()
>>> (with respect to the user population.).  The bigger that spread is, the
>>> worse your estimates will be.
>>>
>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>> estimate
>>
>>
>>
>>> The other option would be to do the cross-product upfront and create
>>> 6000 Tuple sketches.
>>> In this case the input tuple would feed only one sketch and the query
>>> would be to only one sketch.
>>>
>>
>> Since I am not using SQL queries and have my own unique use case, my
>> questions are quite different.
>>
>>
>>> This can quickly get unwieldy with more dimensions or high cardinality
>>> dimensions.
>>> It is more accurate because it avoids the intersections.  It is also
>>> possible to do a mix of
>>> these two approaches, but you would need to think it through carefully.
>>>
>>> I hope this helps.
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>> vijay.sankar.rajan@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Just like theta sketches are used for distinct count metrics like
>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>
>>>> There is a big opportunity that I see in storing very little data in
>>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>>> data very fast (although it is approximate).
>>>>
>>>> This question is best explained with an example. Say I have a fact
>>>> table schema as follows
>>>>
>>>>
>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>    {less than 18, 19-25, 26-35....}
>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>    3. *Country* is a  dimension with 200 possible values
>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>    Theta sketch or HLL)
>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>    sketch or HLL)
>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>    with quantile sketch??*)
>>>>
>>>>
>>>> The maximum possible number of entries in this table would be the cross
>>>> product of the cardinalities of the dimensions which is 10(Age_Group) x
>>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>> group by queries using set operations that sketches provides.
>>>>
>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>> well.
>>>> For click metric, one sketch for Gender=F, another sketch for Gender=M,
>>>> yet another for Gender=Unknown and so on for other dimensions as well.
>>>> Question is what sketch would I need for ad-spend?
>>>>
>>>> On a side note, I have used Theta sketches in this manner and even
>>>> implemented ECLAT for frequent itemset computations and association rule
>>>> mining ... example from another project below with theta sketches used for
>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>
>>>> Level Conjunction Count
>>>> 2 H10 & MemberGender=F 74
>>>> 2 M15 & MemberGender=F 83
>>>> 2 G31 & MemberGender=F 59
>>>> 2 MemberGender=F & R13 66
>>>> *In the example above, H10, M15 etc are International Disease codes for
>>>> specific diseases. *https://www.aapc.com/codes/code-search/
>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>
>>>> Hope this is a clear representation of the problem.
>>>>
>>>> Regards
>>>> Vijay
>>>>
>>>>
>>>>
>>>>
>>>>