You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@datasketches.apache.org by Will Lauer <wl...@verizonmedia.com.INVALID> on 2021/05/20 14:09:28 UTC

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

If you want to do something like this, you should be looking at tuple
sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
allow you to associate extra data with each entry. You can then apply
operations like filtering based on the tuple value, getting the estimate
for only those values that satisfy your tuple predicate. We use tuple
sketches to perform a very similar calculation to the one you are
describing.

Will

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Thu, May 20, 2021 at 6:27 AM yiifeger wu <va...@gmail.com> wrote:

> Hi Lee,
>    Thanks for your reply at first.
>    As for what I proposed, let's take a look for a simple example
>
>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7) in
> the below array
>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>
>    First, if we wanna get the distinct value for the whole data through
> KMV, we should get the hash value for each value in the array V and keep
> the smallest K hashes。
>    In this example， let K =3。
>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ]
>    the smallest 3 hash : [0.12, 0.2, 0.21]
>    (Of course, the real hash function will be more random and evenly
> distributed)
>
>    Secondly, we estimate the distinct value for the whole data,
>     NDV = (3-1)/0.21 ~= 9.5
>
>    Then，if we wanna get the distinct value of that satisfy (vi > 5 && vi <
> 7) , we can realize it through storing extra 3 origin value for each
> smallest hash：
>     smallest 3 hash:   [0.12, 0.2, 0.21]
>     smallest 3 values: [9   , 10 , 6]
>     the value satisfy (vi > 5 && vi < 7) is [6]
>
>    At last, we can estimate the distinct value of that (vi > 5 && vi < 7):
>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>   in which 1.3 represents the fraction of the smallest 3 hash, that
> satisfy (vi > 5 && vi < 7) .
>
>
> leerho <le...@gmail.com> 于2021年5月20日周四 上午7:50写道：
>
>> Yiifeger,
>> I'm sorry, but I'm having difficulty in understanding your question or
>> what you are trying to do.
>>
>> In our library, all of our Theta Sketches are derivatives of the KMV
>> sketch as explained on our website.  And, on our website under *Sketch
>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory* you
>> will find several documents that explain the math behind theta sketches. Of
>> particular relevance would be the original published paper on the Theta
>> Sketch Framework
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>> A simplified development of the math can be found in the short paper Theta
>> Sketch Equations
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>> also from the website.
>>
>> I hope this helps,
>>
>> Lee.
>>
>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <va...@gmail.com> wrote:
>>
>>> Hi all,
>>>      I recently learned about the DataSketch project that is so
>>> brilliant, but questions occurred when prepared to utilize it.
>>>      I want to get the count of distinct values for a range query in my
>>> project. After some study about the KMV algorithm according to the
>>> introduction in DataSketch project, we propose an adjusted KMV
>>> algorithm to solve it.
>>>       In origin KMV, it only stores K  hash_values and then computes the
>>> NDV through the average density. So what if we store extra origin values
>>> for which hash_value contained by the k -Minimum hash_values ?  So we can
>>> estimate the distinct value of the range query through
>>>
>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)  *
>>>> total_ndv*
>>>
>>>
>>>     So if the idea works and the Sketch does not  implement it, could
>>> you give some advice
>>> on how to implement it in this project (P.s prefer the java version).
>>>      Thanks for your help in advance!
>>>
>>>
>
> --
> Best Regards
> Yifei Wu
>

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Posted by yiifeger wu <va...@gmail.com>.

Thanks for your prompt reply and it helps a lot !

leerho <le...@gmail.com> 于2021年5月21日周五 上午3:17写道：

> OK, I stand corrected!  Will and Jon are correct.  This can be solved
> using the tuple sketch.  The resulting RSE, however, will be similar to the
> RSE obtained from an intersection.
>
> Lee.
>
>
> On Thu, May 20, 2021 at 11:04 AM leerho <le...@gmail.com> wrote:
>
>> If I understand the use case correctly, perhaps one way to handle this is
>> to filter the values prior to feeding them into a sketch.  If I have
>> millions of input values between 0.0 and 10.0 and only wish to uniquely
>> count the values > 5 and < 7, then do this:
>>
>> if (vi > 5 && vi < 7) { sketch.update(vi); }
>>
>>
>> There is no ordering or distance relationship between the hash of 5 and
>> the hash of 7, nor can one assume that the number of hash values between
>> the hash of 5 and the hash of 7 has any relationship to the number of
>> unique input values between 5 and 7.  So any attempt to perform filtering
>> on the hash values themselves will fail miserably.  And as Jon commented,
>> we strongly discourage users attempting to manipulate the hash values.
>>
>> I think the use-case that Will mentioned using Tuple Sketches is similar
>> to the Engagement Example
>> <https://datasketches.apache.org/docs/Tuple/TupleEngagementExample.html> (Will,
>> correct me if I'm wrong! ).  This case assumes that the input domain can be
>> neatly divided into a small finite number of regions (in the example, 30
>> days).  It is not clear that Yiifeger's use-case qualifies.  I think what
>> he wants is to be able to sketch the entire domain of continuous
>> numbers between 0.0 and 10.0, and then after the sketch is loaded, somehow
>> apply some query to the sketch to obtain the number of unique values
>> between (pi() and 2*pi()).  I don't think a Tuple sketch would work here,
>> nor can I think of a general solution to this problem.
>>
>> Lee.
>>
>> On Thu, May 20, 2021 at 9:55 AM Jon Malkin <jm...@apache.org> wrote:
>>
>>> Echoing what Will said, you should use a tuple sketch to store the raw
>>> values. You want to avoid any situation where you're storing hash values
>>> yourself.
>>>
>>>   jon
>>>
>>> On Thu, May 20, 2021 at 7:09 AM Will Lauer
>>> <wl...@verizonmedia.com.invalid> wrote:
>>>
>>>> If you want to do something like this, you should be looking at tuple
>>>> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
>>>> allow you to associate extra data with each entry. You can then apply
>>>> operations like filtering based on the tuple value, getting the estimate
>>>> for only those values that satisfy your tuple predicate. We use tuple
>>>> sketches to perform a very similar calculation to the one you are
>>>> describing.
>>>>
>>>> Will
>>>>
>>>> <http://www.verizonmedia.com>
>>>>
>>>> Will Lauer
>>>>
>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>> Data Platforms & Systems Engineering
>>>>
>>>> M 508 561 6427
>>>> 1908 S. First St
>>>> Champaign, IL 61822
>>>>
>>>> <http://www.facebook.com/verizonmedia>
>>>> <http://twitter.com/verizonmedia>
>>>> <https://www.linkedin.com/company/verizon-media/>
>>>> <http://www.instagram.com/verizonmedia>
>>>>
>>>>
>>>>
>>>> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <va...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Lee,
>>>>>    Thanks for your reply at first.
>>>>>    As for what I proposed, let's take a look for a simple example
>>>>>
>>>>>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7)
>>>>> in the below array
>>>>>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>>>>>
>>>>>    First, if we wanna get the distinct value for the whole data
>>>>> through KMV, we should get the hash value for each value in the array V and
>>>>> keep the smallest K hashes。
>>>>>    In this example， let K =3。
>>>>>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>>>>>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21
>>>>> ,0.23,0.12 ]
>>>>>    the smallest 3 hash : [0.12, 0.2, 0.21]
>>>>>    (Of course, the real hash function will be more random and evenly
>>>>> distributed)
>>>>>
>>>>>    Secondly, we estimate the distinct value for the whole data,
>>>>>     NDV = (3-1)/0.21 ~= 9.5
>>>>>
>>>>>    Then，if we wanna get the distinct value of that satisfy (vi > 5 &&
>>>>> vi < 7) , we can realize it through storing extra 3 origin value for each
>>>>> smallest hash：
>>>>>     smallest 3 hash:   [0.12, 0.2, 0.21]
>>>>>     smallest 3 values: [9   , 10 , 6]
>>>>>     the value satisfy (vi > 5 && vi < 7) is [6]
>>>>>
>>>>>    At last, we can estimate the distinct value of that (vi > 5 && vi <
>>>>> 7):
>>>>>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>>>>>   in which 1.3 represents the fraction of the smallest 3 hash, that
>>>>> satisfy (vi > 5 && vi < 7) .
>>>>>
>>>>>
>>>>> leerho <le...@gmail.com> 于2021年5月20日周四 上午7:50写道：
>>>>>
>>>>>> Yiifeger,
>>>>>> I'm sorry, but I'm having difficulty in understanding your question
>>>>>> or what you are trying to do.
>>>>>>
>>>>>> In our library, all of our Theta Sketches are derivatives of the KMV
>>>>>> sketch as explained on our website.  And, on our website under *Sketch
>>>>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory*
>>>>>> you will find several documents that explain the math behind theta
>>>>>> sketches. Of particular relevance would be the original published paper on
>>>>>> the Theta Sketch Framework
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>>>>>> A simplified development of the math can be found in the short paper Theta
>>>>>> Sketch Equations
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>>>>>> also from the website.
>>>>>>
>>>>>> I hope this helps,
>>>>>>
>>>>>> Lee.
>>>>>>
>>>>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <va...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>      I recently learned about the DataSketch project that is so
>>>>>>> brilliant, but questions occurred when prepared to utilize it.
>>>>>>>      I want to get the count of distinct values for a range query
>>>>>>> in my project. After some study about the KMV algorithm according to the
>>>>>>> introduction in DataSketch project, we propose an adjusted KMV
>>>>>>> algorithm to solve it.
>>>>>>>       In origin KMV, it only stores K  hash_values and then computes
>>>>>>> the NDV through the average density. So what if we store extra origin
>>>>>>> values for which hash_value contained by the k -Minimum hash_values ?  So
>>>>>>> we can estimate the distinct value of the range query through
>>>>>>>
>>>>>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)
>>>>>>>> *  total_ndv*
>>>>>>>
>>>>>>>
>>>>>>>     So if the idea works and the Sketch does not  implement it,
>>>>>>> could you give some advice
>>>>>>> on how to implement it in this project (P.s prefer the java version).
>>>>>>>      Thanks for your help in advance!
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>> Yifei Wu
>>>>>
>>>>

-- 
Best Regards
Yifei Wu

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Posted by leerho <le...@gmail.com>.

OK, I stand corrected!  Will and Jon are correct.  This can be solved using
the tuple sketch.  The resulting RSE, however, will be similar to the RSE
obtained from an intersection.

Lee.


On Thu, May 20, 2021 at 11:04 AM leerho <le...@gmail.com> wrote:

> If I understand the use case correctly, perhaps one way to handle this is
> to filter the values prior to feeding them into a sketch.  If I have
> millions of input values between 0.0 and 10.0 and only wish to uniquely
> count the values > 5 and < 7, then do this:
>
> if (vi > 5 && vi < 7) { sketch.update(vi); }
>
>
> There is no ordering or distance relationship between the hash of 5 and
> the hash of 7, nor can one assume that the number of hash values between
> the hash of 5 and the hash of 7 has any relationship to the number of
> unique input values between 5 and 7.  So any attempt to perform filtering
> on the hash values themselves will fail miserably.  And as Jon commented,
> we strongly discourage users attempting to manipulate the hash values.
>
> I think the use-case that Will mentioned using Tuple Sketches is similar
> to the Engagement Example
> <https://datasketches.apache.org/docs/Tuple/TupleEngagementExample.html> (Will,
> correct me if I'm wrong! ).  This case assumes that the input domain can be
> neatly divided into a small finite number of regions (in the example, 30
> days).  It is not clear that Yiifeger's use-case qualifies.  I think what
> he wants is to be able to sketch the entire domain of continuous
> numbers between 0.0 and 10.0, and then after the sketch is loaded, somehow
> apply some query to the sketch to obtain the number of unique values
> between (pi() and 2*pi()).  I don't think a Tuple sketch would work here,
> nor can I think of a general solution to this problem.
>
> Lee.
>
> On Thu, May 20, 2021 at 9:55 AM Jon Malkin <jm...@apache.org> wrote:
>
>> Echoing what Will said, you should use a tuple sketch to store the raw
>> values. You want to avoid any situation where you're storing hash values
>> yourself.
>>
>>   jon
>>
>> On Thu, May 20, 2021 at 7:09 AM Will Lauer
>> <wl...@verizonmedia.com.invalid> wrote:
>>
>>> If you want to do something like this, you should be looking at tuple
>>> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
>>> allow you to associate extra data with each entry. You can then apply
>>> operations like filtering based on the tuple value, getting the estimate
>>> for only those values that satisfy your tuple predicate. We use tuple
>>> sketches to perform a very similar calculation to the one you are
>>> describing.
>>>
>>> Will
>>>
>>> <http://www.verizonmedia.com>
>>>
>>> Will Lauer
>>>
>>> Senior Principal Architect, Audience & Advertising Reporting
>>> Data Platforms & Systems Engineering
>>>
>>> M 508 561 6427
>>> 1908 S. First St
>>> Champaign, IL 61822
>>>
>>> <http://www.facebook.com/verizonmedia>
>>> <http://twitter.com/verizonmedia>
>>> <https://www.linkedin.com/company/verizon-media/>
>>> <http://www.instagram.com/verizonmedia>
>>>
>>>
>>>
>>> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <va...@gmail.com>
>>> wrote:
>>>
>>>> Hi Lee,
>>>>    Thanks for your reply at first.
>>>>    As for what I proposed, let's take a look for a simple example
>>>>
>>>>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7)
>>>> in the below array
>>>>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>>>>
>>>>    First, if we wanna get the distinct value for the whole data through
>>>> KMV, we should get the hash value for each value in the array V and keep
>>>> the smallest K hashes。
>>>>    In this example， let K =3。
>>>>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>>>>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12
>>>> ]
>>>>    the smallest 3 hash : [0.12, 0.2, 0.21]
>>>>    (Of course, the real hash function will be more random and evenly
>>>> distributed)
>>>>
>>>>    Secondly, we estimate the distinct value for the whole data,
>>>>     NDV = (3-1)/0.21 ~= 9.5
>>>>
>>>>    Then，if we wanna get the distinct value of that satisfy (vi > 5 &&
>>>> vi < 7) , we can realize it through storing extra 3 origin value for each
>>>> smallest hash：
>>>>     smallest 3 hash:   [0.12, 0.2, 0.21]
>>>>     smallest 3 values: [9   , 10 , 6]
>>>>     the value satisfy (vi > 5 && vi < 7) is [6]
>>>>
>>>>    At last, we can estimate the distinct value of that (vi > 5 && vi <
>>>> 7):
>>>>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>>>>   in which 1.3 represents the fraction of the smallest 3 hash, that
>>>> satisfy (vi > 5 && vi < 7) .
>>>>
>>>>
>>>> leerho <le...@gmail.com> 于2021年5月20日周四 上午7:50写道：
>>>>
>>>>> Yiifeger,
>>>>> I'm sorry, but I'm having difficulty in understanding your question
>>>>> or what you are trying to do.
>>>>>
>>>>> In our library, all of our Theta Sketches are derivatives of the KMV
>>>>> sketch as explained on our website.  And, on our website under *Sketch
>>>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory*
>>>>> you will find several documents that explain the math behind theta
>>>>> sketches. Of particular relevance would be the original published paper on
>>>>> the Theta Sketch Framework
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>>>>> A simplified development of the math can be found in the short paper Theta
>>>>> Sketch Equations
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>>>>> also from the website.
>>>>>
>>>>> I hope this helps,
>>>>>
>>>>> Lee.
>>>>>
>>>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <va...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>      I recently learned about the DataSketch project that is so
>>>>>> brilliant, but questions occurred when prepared to utilize it.
>>>>>>      I want to get the count of distinct values for a range query in
>>>>>> my project. After some study about the KMV algorithm according to the
>>>>>> introduction in DataSketch project, we propose an adjusted KMV
>>>>>> algorithm to solve it.
>>>>>>       In origin KMV, it only stores K  hash_values and then computes
>>>>>> the NDV through the average density. So what if we store extra origin
>>>>>> values for which hash_value contained by the k -Minimum hash_values ?  So
>>>>>> we can estimate the distinct value of the range query through
>>>>>>
>>>>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)
>>>>>>> *  total_ndv*
>>>>>>
>>>>>>
>>>>>>     So if the idea works and the Sketch does not  implement it, could
>>>>>> you give some advice
>>>>>> on how to implement it in this project (P.s prefer the java version).
>>>>>>      Thanks for your help in advance!
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Best Regards
>>>> Yifei Wu
>>>>
>>>

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Posted by leerho <le...@gmail.com>.

If I understand the use case correctly, perhaps one way to handle this is
to filter the values prior to feeding them into a sketch.  If I have
millions of input values between 0.0 and 10.0 and only wish to uniquely
count the values > 5 and < 7, then do this:

if (vi > 5 && vi < 7) { sketch.update(vi); }


There is no ordering or distance relationship between the hash of 5 and the
hash of 7, nor can one assume that the number of hash values between the
hash of 5 and the hash of 7 has any relationship to the number of unique
input values between 5 and 7.  So any attempt to perform filtering on the
hash values themselves will fail miserably.  And as Jon commented, we
strongly discourage users attempting to manipulate the hash values.

I think the use-case that Will mentioned using Tuple Sketches is similar to
the Engagement Example
<https://datasketches.apache.org/docs/Tuple/TupleEngagementExample.html> (Will,
correct me if I'm wrong! ).  This case assumes that the input domain can be
neatly divided into a small finite number of regions (in the example, 30
days).  It is not clear that Yiifeger's use-case qualifies.  I think what
he wants is to be able to sketch the entire domain of continuous
numbers between 0.0 and 10.0, and then after the sketch is loaded, somehow
apply some query to the sketch to obtain the number of unique values
between (pi() and 2*pi()).  I don't think a Tuple sketch would work here,
nor can I think of a general solution to this problem.

Lee.

On Thu, May 20, 2021 at 9:55 AM Jon Malkin <jm...@apache.org> wrote:

> Echoing what Will said, you should use a tuple sketch to store the raw
> values. You want to avoid any situation where you're storing hash values
> yourself.
>
>   jon
>
> On Thu, May 20, 2021 at 7:09 AM Will Lauer <wl...@verizonmedia.com.invalid>
> wrote:
>
>> If you want to do something like this, you should be looking at tuple
>> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
>> allow you to associate extra data with each entry. You can then apply
>> operations like filtering based on the tuple value, getting the estimate
>> for only those values that satisfy your tuple predicate. We use tuple
>> sketches to perform a very similar calculation to the one you are
>> describing.
>>
>> Will
>>
>> <http://www.verizonmedia.com>
>>
>> Will Lauer
>>
>> Senior Principal Architect, Audience & Advertising Reporting
>> Data Platforms & Systems Engineering
>>
>> M 508 561 6427
>> 1908 S. First St
>> Champaign, IL 61822
>>
>> <http://www.facebook.com/verizonmedia>
>> <http://twitter.com/verizonmedia>
>> <https://www.linkedin.com/company/verizon-media/>
>> <http://www.instagram.com/verizonmedia>
>>
>>
>>
>> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <va...@gmail.com> wrote:
>>
>>> Hi Lee,
>>>    Thanks for your reply at first.
>>>    As for what I proposed, let's take a look for a simple example
>>>
>>>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7)
>>> in the below array
>>>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>>>
>>>    First, if we wanna get the distinct value for the whole data through
>>> KMV, we should get the hash value for each value in the array V and keep
>>> the smallest K hashes。
>>>    In this example， let K =3。
>>>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>>>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ]
>>>    the smallest 3 hash : [0.12, 0.2, 0.21]
>>>    (Of course, the real hash function will be more random and evenly
>>> distributed)
>>>
>>>    Secondly, we estimate the distinct value for the whole data,
>>>     NDV = (3-1)/0.21 ~= 9.5
>>>
>>>    Then，if we wanna get the distinct value of that satisfy (vi > 5 && vi
>>> < 7) , we can realize it through storing extra 3 origin value for each
>>> smallest hash：
>>>     smallest 3 hash:   [0.12, 0.2, 0.21]
>>>     smallest 3 values: [9   , 10 , 6]
>>>     the value satisfy (vi > 5 && vi < 7) is [6]
>>>
>>>    At last, we can estimate the distinct value of that (vi > 5 && vi <
>>> 7):
>>>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>>>   in which 1.3 represents the fraction of the smallest 3 hash, that
>>> satisfy (vi > 5 && vi < 7) .
>>>
>>>
>>> leerho <le...@gmail.com> 于2021年5月20日周四 上午7:50写道：
>>>
>>>> Yiifeger,
>>>> I'm sorry, but I'm having difficulty in understanding your question or
>>>> what you are trying to do.
>>>>
>>>> In our library, all of our Theta Sketches are derivatives of the KMV
>>>> sketch as explained on our website.  And, on our website under *Sketch
>>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory*
>>>> you will find several documents that explain the math behind theta
>>>> sketches. Of particular relevance would be the original published paper on
>>>> the Theta Sketch Framework
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>>>> A simplified development of the math can be found in the short paper Theta
>>>> Sketch Equations
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>>>> also from the website.
>>>>
>>>> I hope this helps,
>>>>
>>>> Lee.
>>>>
>>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <va...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>      I recently learned about the DataSketch project that is so
>>>>> brilliant, but questions occurred when prepared to utilize it.
>>>>>      I want to get the count of distinct values for a range query in
>>>>> my project. After some study about the KMV algorithm according to the
>>>>> introduction in DataSketch project, we propose an adjusted KMV
>>>>> algorithm to solve it.
>>>>>       In origin KMV, it only stores K  hash_values and then computes
>>>>> the NDV through the average density. So what if we store extra origin
>>>>> values for which hash_value contained by the k -Minimum hash_values ?  So
>>>>> we can estimate the distinct value of the range query through
>>>>>
>>>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)
>>>>>> *  total_ndv*
>>>>>
>>>>>
>>>>>     So if the idea works and the Sketch does not  implement it, could
>>>>> you give some advice
>>>>> on how to implement it in this project (P.s prefer the java version).
>>>>>      Thanks for your help in advance!
>>>>>
>>>>>
>>>
>>> --
>>> Best Regards
>>> Yifei Wu
>>>
>>

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Posted by Jon Malkin <jm...@apache.org>.

Echoing what Will said, you should use a tuple sketch to store the raw
values. You want to avoid any situation where you're storing hash values
yourself.

  jon

On Thu, May 20, 2021 at 7:09 AM Will Lauer <wl...@verizonmedia.com.invalid>
wrote:

> If you want to do something like this, you should be looking at tuple
> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
> allow you to associate extra data with each entry. You can then apply
> operations like filtering based on the tuple value, getting the estimate
> for only those values that satisfy your tuple predicate. We use tuple
> sketches to perform a very similar calculation to the one you are
> describing.
>
> Will
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <va...@gmail.com> wrote:
>
>> Hi Lee,
>>    Thanks for your reply at first.
>>    As for what I proposed, let's take a look for a simple example
>>
>>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7) in
>> the below array
>>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>>
>>    First, if we wanna get the distinct value for the whole data through
>> KMV, we should get the hash value for each value in the array V and keep
>> the smallest K hashes。
>>    In this example， let K =3。
>>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ]
>>    the smallest 3 hash : [0.12, 0.2, 0.21]
>>    (Of course, the real hash function will be more random and evenly
>> distributed)
>>
>>    Secondly, we estimate the distinct value for the whole data,
>>     NDV = (3-1)/0.21 ~= 9.5
>>
>>    Then，if we wanna get the distinct value of that satisfy (vi > 5 && vi
>> < 7) , we can realize it through storing extra 3 origin value for each
>> smallest hash：
>>     smallest 3 hash:   [0.12, 0.2, 0.21]
>>     smallest 3 values: [9   , 10 , 6]
>>     the value satisfy (vi > 5 && vi < 7) is [6]
>>
>>    At last, we can estimate the distinct value of that (vi > 5 && vi < 7):
>>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>>   in which 1.3 represents the fraction of the smallest 3 hash, that
>> satisfy (vi > 5 && vi < 7) .
>>
>>
>> leerho <le...@gmail.com> 于2021年5月20日周四 上午7:50写道：
>>
>>> Yiifeger,
>>> I'm sorry, but I'm having difficulty in understanding your question or
>>> what you are trying to do.
>>>
>>> In our library, all of our Theta Sketches are derivatives of the KMV
>>> sketch as explained on our website.  And, on our website under *Sketch
>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory*
>>> you will find several documents that explain the math behind theta
>>> sketches. Of particular relevance would be the original published paper on
>>> the Theta Sketch Framework
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>>> A simplified development of the math can be found in the short paper Theta
>>> Sketch Equations
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>>> also from the website.
>>>
>>> I hope this helps,
>>>
>>> Lee.
>>>
>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <va...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>      I recently learned about the DataSketch project that is so
>>>> brilliant, but questions occurred when prepared to utilize it.
>>>>      I want to get the count of distinct values for a range query in
>>>> my project. After some study about the KMV algorithm according to the
>>>> introduction in DataSketch project, we propose an adjusted KMV
>>>> algorithm to solve it.
>>>>       In origin KMV, it only stores K  hash_values and then computes
>>>> the NDV through the average density. So what if we store extra origin
>>>> values for which hash_value contained by the k -Minimum hash_values ?  So
>>>> we can estimate the distinct value of the range query through
>>>>
>>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)  *
>>>>> total_ndv*
>>>>
>>>>
>>>>     So if the idea works and the Sketch does not  implement it, could
>>>> you give some advice
>>>> on how to implement it in this project (P.s prefer the java version).
>>>>      Thanks for your help in advance!
>>>>
>>>>
>>
>> --
>> Best Regards
>> Yifei Wu
>>
>