You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@datasketches.apache.org by Marko Mušnjak <ma...@gmail.com> on 2020/08/13 15:40:40 UTC

HLL Union and lgK config

Hi,

Could someone help me understand a behavior I see when trying to union some
HLL sketches?

I have 14 HLL sketches, and I know the exact unique counts for each of
them. All the individual sketches give estimates within 2% of the exact
counts.

When I try to create a union, using the default lgMaxK parameter results in
total estimate that is way off (25% larger then exact count).

However, reducing the lgMaxK parameter in the union to 4 or 5 gives results
that are within 2.5% of the exact counts.

Also, one particular sketch seems to cause the final estimate to jump - not
adding that sketch to the union keeps the result close to the exact count.

Am I just seeing a very bad random error, or is there anything I'm doing
wrong with the unions?

Running on Java, using version 1.3.0. Just in case, the sketches are in the
linked gist (hex encoded, one per line):
https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
and the exact counts:
https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c

Thank you!
Marko Musnjak

Re: [E] Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
Thanks for the update!

On Mon, Sep 14, 2020 at 12:31 AM Marko Mušnjak <ma...@gmail.com>
wrote:

> Hi,
> I just wanted to confirm that simply converting the strings to charArray
> worked fine - the sketches from the hive library merged with the kstreams
> sketches now produce correct results.
>
> Thanks again for the help!
>
> On Fri, 14 Aug 2020 at 22:51, Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> It does seem the first two days (probably from Spark+Hive UDFs) merged by
>> themselves, closely match the exact count of 11034. The other 12 days
>> (built using Kafka Streams) taken together also closely match the exact
>> count for the period.
>>
>> That would mean we have our cause here.
>>
>> Now to discover how strings are represented in Spark's input files and in
>> Avro records in Kafka... I see the
>> org.apache.datasketches.hive.hll.SketchState::update converts strings to
>> char array, while just updating with String in
>> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
>> and hashes the resulting byte array. Maybe trying with converting strings
>> in the Kafka Streams app to char[] will be a good first step.
>>
>> I'll give that a try and report back.
>>
>> Thanks everyone for your help in finding the source of this!
>>
>> Kind regards,
>> Marko
>>
>> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>>
>>> Hi Marko,
>>>
>>> As I stated before the first 2 sketches are the result of union
>>> operations, while the rest are not.  I get the following:
>>>
>>> All 14 sketches : 34530
>>> Without the first day : 27501; your count 24890;  Error = 10.5%   This
>>> is already way off. it represents an error of nearly 7 standard deviations,
>>> which is huge!
>>> Without the first and second day : 22919;  your count 22989; Error =
>>> -0.3%   This is well within the error bounds.
>>>
>>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>>> the same results with our C++ library.  Also, the C++ library was
>>> redesigned from the ground up.  I think it is highly unlikely we would have
>>> such a serious bug in all three versions without it being detected
>>> elsewhere.
>>>
>>> I think Alex is on the right track.  If you encode the same input IDs
>>> differently in two different environments they are essentially distinct
>>> from each other causing the unique count to go up.
>>>
>>> Please let us know what you find out.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>>> saydakov@verizonmedia.com> wrote:
>>>
>>>> Since you are mixing sketches built in different environments, have you
>>>> ever tested that the input strings are hashed the same way? There is a
>>>> chance that strings might be represented differently in Hive and Spark, and
>>>> therefore the resulting sketches might be disjoint while you might believe
>>>> that they should represent overlapping sets. The crucial part of these
>>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>>> the sketches are not compatible. They will represent disjoint sets.
>>>> I would suggest trying a simple test: build sketches from a few
>>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>>> union of those sketches does not grow.
>>>>
>>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The sketches are string-fed.
>>>>>
>>>>> Some of the sketches are built using Spark and the Hive functions from
>>>>> the datasketches library, while others are built using a kafka streams job.
>>>>> It's quite likely the covered period contains some sketches built by Spark
>>>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>>>> The Spark job is using
>>>>> org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>>> The streaming job is building the sketches through Union objects
>>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>>> forwards the result as sketch).
>>>>>
>>>>> After some adjustments to the queries I'm running to get the exact
>>>>> counts, to take care of local times, etc..., these should be the correct
>>>>> values with excluded days:
>>>>> Without first day: 24890
>>>>> Without first and second day: 22989
>>>>>
>>>>> Thanks,
>>>>> Marko
>>>>>
>>>>>
>>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>>
>>>>>> Hi Marko,
>>>>>> I notice that the first two sketches are the result of union
>>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>>> Could you perform Jon's request again except excluding the first two
>>>>>> sketches?
>>>>>>
>>>>>> Just to cover the bases, could you explain the types of the
>>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>>> strings, longs or what?
>>>>>>
>>>>>> Thanks,
>>>>>> Lee.
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>>> questions.
>>>>>>>
>>>>>>>   jon
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <
>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Jon,
>>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>>> without the first sketch is 24765.
>>>>>>>>
>>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>>>> within 2%.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Marko
>>>>>>>>
>>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Marko,
>>>>>>>>>
>>>>>>>>> Could you please let us know two more things:
>>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>>> jump?
>>>>>>>>> 2) What is the exact unique count of the others without that
>>>>>>>>> sketch?
>>>>>>>>>
>>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>   jon
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>>> union some HLL sketches?
>>>>>>>>>>
>>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>>> exact counts.
>>>>>>>>>>
>>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>>
>>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>>
>>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>>> exact count.
>>>>>>>>>>
>>>>>>>>>> Am I just seeing a very bad random error, or is there anything
>>>>>>>>>> I'm doing wrong with the unions?
>>>>>>>>>>
>>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>>> and the exact counts:
>>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>> Marko Musnjak
>>>>>>>>>>
>>>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by Marko Mušnjak <ma...@gmail.com>.
Hi,
I just wanted to confirm that simply converting the strings to charArray
worked fine - the sketches from the hive library merged with the kstreams
sketches now produce correct results.

Thanks again for the help!

On Fri, 14 Aug 2020 at 22:51, Marko Mušnjak <ma...@gmail.com> wrote:

> Hi,
>
> It does seem the first two days (probably from Spark+Hive UDFs) merged by
> themselves, closely match the exact count of 11034. The other 12 days
> (built using Kafka Streams) taken together also closely match the exact
> count for the period.
>
> That would mean we have our cause here.
>
> Now to discover how strings are represented in Spark's input files and in
> Avro records in Kafka... I see the
> org.apache.datasketches.hive.hll.SketchState::update converts strings to
> char array, while just updating with String in
> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
> and hashes the resulting byte array. Maybe trying with converting strings
> in the Kafka Streams app to char[] will be a good first step.
>
> I'll give that a try and report back.
>
> Thanks everyone for your help in finding the source of this!
>
> Kind regards,
> Marko
>
> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>
>> Hi Marko,
>>
>> As I stated before the first 2 sketches are the result of union
>> operations, while the rest are not.  I get the following:
>>
>> All 14 sketches : 34530
>> Without the first day : 27501; your count 24890;  Error = 10.5%   This is
>> already way off. it represents an error of nearly 7 standard deviations,
>> which is huge!
>> Without the first and second day : 22919;  your count 22989; Error =
>> -0.3%   This is well within the error bounds.
>>
>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>> the same results with our C++ library.  Also, the C++ library was
>> redesigned from the ground up.  I think it is highly unlikely we would have
>> such a serious bug in all three versions without it being detected
>> elsewhere.
>>
>> I think Alex is on the right track.  If you encode the same input IDs
>> differently in two different environments they are essentially distinct
>> from each other causing the unique count to go up.
>>
>> Please let us know what you find out.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>> saydakov@verizonmedia.com> wrote:
>>
>>> Since you are mixing sketches built in different environments, have you
>>> ever tested that the input strings are hashed the same way? There is a
>>> chance that strings might be represented differently in Hive and Spark, and
>>> therefore the resulting sketches might be disjoint while you might believe
>>> that they should represent overlapping sets. The crucial part of these
>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>> the sketches are not compatible. They will represent disjoint sets.
>>> I would suggest trying a simple test: build sketches from a few
>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>> union of those sketches does not grow.
>>>
>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The sketches are string-fed.
>>>>
>>>> Some of the sketches are built using Spark and the Hive functions from
>>>> the datasketches library, while others are built using a kafka streams job.
>>>> It's quite likely the covered period contains some sketches built by Spark
>>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>>> The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>> The streaming job is building the sketches through Union objects
>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>> forwards the result as sketch).
>>>>
>>>> After some adjustments to the queries I'm running to get the exact
>>>> counts, to take care of local times, etc..., these should be the correct
>>>> values with excluded days:
>>>> Without first day: 24890
>>>> Without first and second day: 22989
>>>>
>>>> Thanks,
>>>> Marko
>>>>
>>>>
>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>
>>>>> Hi Marko,
>>>>> I notice that the first two sketches are the result of union
>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>> Could you perform Jon's request again except excluding the first two
>>>>> sketches?
>>>>>
>>>>> Just to cover the bases, could you explain the types of the
>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>> strings, longs or what?
>>>>>
>>>>> Thanks,
>>>>> Lee.
>>>>>
>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>> questions.
>>>>>>
>>>>>>   jon
>>>>>>
>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jon,
>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>> without the first sketch is 24765.
>>>>>>>
>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>>> within 2%.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marko
>>>>>>>
>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Marko,
>>>>>>>>
>>>>>>>> Could you please let us know two more things:
>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>> jump?
>>>>>>>> 2) What is the exact unique count of the others without that sketch?
>>>>>>>>
>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>   jon
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>> union some HLL sketches?
>>>>>>>>>
>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>> exact counts.
>>>>>>>>>
>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>
>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>
>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>> exact count.
>>>>>>>>>
>>>>>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>>>>>> doing wrong with the unions?
>>>>>>>>>
>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>> and the exact counts:
>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>> Marko Musnjak
>>>>>>>>>
>>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
I have placed a [DISCUSS] thread on our dev@datasketches.apache.org list if
you wish to suggest some ideas! :)

On Fri, Aug 14, 2020 at 4:06 PM leerho <le...@gmail.com> wrote:

> The other option would be to deprecate the Hive SketchState update(...)
> method and create a "newUpdate(...) method that has strings encode with
> UTF-8.  And also document the reason why.   Any other ideas?
>
> On Fri, Aug 14, 2020 at 4:03 PM leerho <le...@gmail.com> wrote:
>
>> Yep!  It turns out that there is already an issue
>> <https://github.com/apache/incubator-datasketches-hive/issues/54> on
>> this that was reported 18 days ago. Changing this will be fraught with
>> problems as other Hive users may have a history of sketches created with
>> Strings encoded as char[].  I'm not sure I see an easy solution other than
>> documenting it & putting warnings everywhere.
>>
>> On Fri, Aug 14, 2020 at 1:51 PM Marko Mušnjak <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It does seem the first two days (probably from Spark+Hive UDFs) merged
>>> by themselves, closely match the exact count of 11034. The other 12 days
>>> (built using Kafka Streams) taken together also closely match the exact
>>> count for the period.
>>>
>>> That would mean we have our cause here.
>>>
>>> Now to discover how strings are represented in Spark's input files and
>>> in Avro records in Kafka... I see the
>>> org.apache.datasketches.hive.hll.SketchState::update converts strings to
>>> char array, while just updating with String in
>>> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
>>> and hashes the resulting byte array. Maybe trying with converting strings
>>> in the Kafka Streams app to char[] will be a good first step.
>>>
>>> I'll give that a try and report back.
>>>
>>> Thanks everyone for your help in finding the source of this!
>>>
>>> Kind regards,
>>> Marko
>>>
>>> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>>>
>>>> Hi Marko,
>>>>
>>>> As I stated before the first 2 sketches are the result of union
>>>> operations, while the rest are not.  I get the following:
>>>>
>>>> All 14 sketches : 34530
>>>> Without the first day : 27501; your count 24890;  Error = 10.5%   This
>>>> is already way off. it represents an error of nearly 7 standard deviations,
>>>> which is huge!
>>>> Without the first and second day : 22919;  your count 22989; Error =
>>>> -0.3%   This is well within the error bounds.
>>>>
>>>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>>>> the same results with our C++ library.  Also, the C++ library was
>>>> redesigned from the ground up.  I think it is highly unlikely we would have
>>>> such a serious bug in all three versions without it being detected
>>>> elsewhere.
>>>>
>>>> I think Alex is on the right track.  If you encode the same input IDs
>>>> differently in two different environments they are essentially distinct
>>>> from each other causing the unique count to go up.
>>>>
>>>> Please let us know what you find out.
>>>>
>>>> Cheers,
>>>>
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>>>> saydakov@verizonmedia.com> wrote:
>>>>
>>>>> Since you are mixing sketches built in different environments, have
>>>>> you ever tested that the input strings are hashed the same way? There is a
>>>>> chance that strings might be represented differently in Hive and Spark, and
>>>>> therefore the resulting sketches might be disjoint while you might believe
>>>>> that they should represent overlapping sets. The crucial part of these
>>>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>>>> the sketches are not compatible. They will represent disjoint sets.
>>>>> I would suggest trying a simple test: build sketches from a few
>>>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>>>> union of those sketches does not grow.
>>>>>
>>>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The sketches are string-fed.
>>>>>>
>>>>>> Some of the sketches are built using Spark and the Hive functions
>>>>>> from the datasketches library, while others are built using a kafka streams
>>>>>> job. It's quite likely the covered period contains some sketches built by
>>>>>> Spark and some by the streaming job, but I can't tell where the exact
>>>>>> cutoff was.
>>>>>> The Spark job is using
>>>>>> org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>>>> The streaming job is building the sketches through Union objects
>>>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>>>> forwards the result as sketch).
>>>>>>
>>>>>> After some adjustments to the queries I'm running to get the exact
>>>>>> counts, to take care of local times, etc..., these should be the correct
>>>>>> values with excluded days:
>>>>>> Without first day: 24890
>>>>>> Without first and second day: 22989
>>>>>>
>>>>>> Thanks,
>>>>>> Marko
>>>>>>
>>>>>>
>>>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Marko,
>>>>>>> I notice that the first two sketches are the result of union
>>>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>>>> Could you perform Jon's request again except excluding the first two
>>>>>>> sketches?
>>>>>>>
>>>>>>> Just to cover the bases, could you explain the types of the
>>>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>>>> strings, longs or what?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lee.
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>>>> questions.
>>>>>>>>
>>>>>>>>   jon
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <
>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Jon,
>>>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>>>> without the first sketch is 24765.
>>>>>>>>>
>>>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5
>>>>>>>>> is within 2%.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Marko
>>>>>>>>>
>>>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Marko,
>>>>>>>>>>
>>>>>>>>>> Could you please let us know two more things:
>>>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>>>> jump?
>>>>>>>>>> 2) What is the exact unique count of the others without that
>>>>>>>>>> sketch?
>>>>>>>>>>
>>>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>   jon
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>>>> union some HLL sketches?
>>>>>>>>>>>
>>>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>>>> exact counts.
>>>>>>>>>>>
>>>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>>>
>>>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>>>
>>>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>>>> exact count.
>>>>>>>>>>>
>>>>>>>>>>> Am I just seeing a very bad random error, or is there anything
>>>>>>>>>>> I'm doing wrong with the unions?
>>>>>>>>>>>
>>>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>>>> and the exact counts:
>>>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>> Marko Musnjak
>>>>>>>>>>>
>>>>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
The other option would be to deprecate the Hive SketchState update(...)
method and create a "newUpdate(...) method that has strings encode with
UTF-8.  And also document the reason why.   Any other ideas?

On Fri, Aug 14, 2020 at 4:03 PM leerho <le...@gmail.com> wrote:

> Yep!  It turns out that there is already an issue
> <https://github.com/apache/incubator-datasketches-hive/issues/54> on this
> that was reported 18 days ago. Changing this will be fraught with problems
> as other Hive users may have a history of sketches created with Strings
> encoded as char[].  I'm not sure I see an easy solution other than
> documenting it & putting warnings everywhere.
>
> On Fri, Aug 14, 2020 at 1:51 PM Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> It does seem the first two days (probably from Spark+Hive UDFs) merged by
>> themselves, closely match the exact count of 11034. The other 12 days
>> (built using Kafka Streams) taken together also closely match the exact
>> count for the period.
>>
>> That would mean we have our cause here.
>>
>> Now to discover how strings are represented in Spark's input files and in
>> Avro records in Kafka... I see the
>> org.apache.datasketches.hive.hll.SketchState::update converts strings to
>> char array, while just updating with String in
>> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
>> and hashes the resulting byte array. Maybe trying with converting strings
>> in the Kafka Streams app to char[] will be a good first step.
>>
>> I'll give that a try and report back.
>>
>> Thanks everyone for your help in finding the source of this!
>>
>> Kind regards,
>> Marko
>>
>> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>>
>>> Hi Marko,
>>>
>>> As I stated before the first 2 sketches are the result of union
>>> operations, while the rest are not.  I get the following:
>>>
>>> All 14 sketches : 34530
>>> Without the first day : 27501; your count 24890;  Error = 10.5%   This
>>> is already way off. it represents an error of nearly 7 standard deviations,
>>> which is huge!
>>> Without the first and second day : 22919;  your count 22989; Error =
>>> -0.3%   This is well within the error bounds.
>>>
>>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>>> the same results with our C++ library.  Also, the C++ library was
>>> redesigned from the ground up.  I think it is highly unlikely we would have
>>> such a serious bug in all three versions without it being detected
>>> elsewhere.
>>>
>>> I think Alex is on the right track.  If you encode the same input IDs
>>> differently in two different environments they are essentially distinct
>>> from each other causing the unique count to go up.
>>>
>>> Please let us know what you find out.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>>> saydakov@verizonmedia.com> wrote:
>>>
>>>> Since you are mixing sketches built in different environments, have you
>>>> ever tested that the input strings are hashed the same way? There is a
>>>> chance that strings might be represented differently in Hive and Spark, and
>>>> therefore the resulting sketches might be disjoint while you might believe
>>>> that they should represent overlapping sets. The crucial part of these
>>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>>> the sketches are not compatible. They will represent disjoint sets.
>>>> I would suggest trying a simple test: build sketches from a few
>>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>>> union of those sketches does not grow.
>>>>
>>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The sketches are string-fed.
>>>>>
>>>>> Some of the sketches are built using Spark and the Hive functions from
>>>>> the datasketches library, while others are built using a kafka streams job.
>>>>> It's quite likely the covered period contains some sketches built by Spark
>>>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>>>> The Spark job is using
>>>>> org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>>> The streaming job is building the sketches through Union objects
>>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>>> forwards the result as sketch).
>>>>>
>>>>> After some adjustments to the queries I'm running to get the exact
>>>>> counts, to take care of local times, etc..., these should be the correct
>>>>> values with excluded days:
>>>>> Without first day: 24890
>>>>> Without first and second day: 22989
>>>>>
>>>>> Thanks,
>>>>> Marko
>>>>>
>>>>>
>>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>>
>>>>>> Hi Marko,
>>>>>> I notice that the first two sketches are the result of union
>>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>>> Could you perform Jon's request again except excluding the first two
>>>>>> sketches?
>>>>>>
>>>>>> Just to cover the bases, could you explain the types of the
>>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>>> strings, longs or what?
>>>>>>
>>>>>> Thanks,
>>>>>> Lee.
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>>> questions.
>>>>>>>
>>>>>>>   jon
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <
>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Jon,
>>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>>> without the first sketch is 24765.
>>>>>>>>
>>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>>>> within 2%.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Marko
>>>>>>>>
>>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Marko,
>>>>>>>>>
>>>>>>>>> Could you please let us know two more things:
>>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>>> jump?
>>>>>>>>> 2) What is the exact unique count of the others without that
>>>>>>>>> sketch?
>>>>>>>>>
>>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>   jon
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>>> union some HLL sketches?
>>>>>>>>>>
>>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>>> exact counts.
>>>>>>>>>>
>>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>>
>>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>>
>>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>>> exact count.
>>>>>>>>>>
>>>>>>>>>> Am I just seeing a very bad random error, or is there anything
>>>>>>>>>> I'm doing wrong with the unions?
>>>>>>>>>>
>>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>>> and the exact counts:
>>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>> Marko Musnjak
>>>>>>>>>>
>>>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
Yep!  It turns out that there is already an issue
<https://github.com/apache/incubator-datasketches-hive/issues/54> on this
that was reported 18 days ago. Changing this will be fraught with problems
as other Hive users may have a history of sketches created with Strings
encoded as char[].  I'm not sure I see an easy solution other than
documenting it & putting warnings everywhere.

On Fri, Aug 14, 2020 at 1:51 PM Marko Mušnjak <ma...@gmail.com>
wrote:

> Hi,
>
> It does seem the first two days (probably from Spark+Hive UDFs) merged by
> themselves, closely match the exact count of 11034. The other 12 days
> (built using Kafka Streams) taken together also closely match the exact
> count for the period.
>
> That would mean we have our cause here.
>
> Now to discover how strings are represented in Spark's input files and in
> Avro records in Kafka... I see the
> org.apache.datasketches.hive.hll.SketchState::update converts strings to
> char array, while just updating with String in
> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
> and hashes the resulting byte array. Maybe trying with converting strings
> in the Kafka Streams app to char[] will be a good first step.
>
> I'll give that a try and report back.
>
> Thanks everyone for your help in finding the source of this!
>
> Kind regards,
> Marko
>
> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>
>> Hi Marko,
>>
>> As I stated before the first 2 sketches are the result of union
>> operations, while the rest are not.  I get the following:
>>
>> All 14 sketches : 34530
>> Without the first day : 27501; your count 24890;  Error = 10.5%   This is
>> already way off. it represents an error of nearly 7 standard deviations,
>> which is huge!
>> Without the first and second day : 22919;  your count 22989; Error =
>> -0.3%   This is well within the error bounds.
>>
>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>> the same results with our C++ library.  Also, the C++ library was
>> redesigned from the ground up.  I think it is highly unlikely we would have
>> such a serious bug in all three versions without it being detected
>> elsewhere.
>>
>> I think Alex is on the right track.  If you encode the same input IDs
>> differently in two different environments they are essentially distinct
>> from each other causing the unique count to go up.
>>
>> Please let us know what you find out.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>> saydakov@verizonmedia.com> wrote:
>>
>>> Since you are mixing sketches built in different environments, have you
>>> ever tested that the input strings are hashed the same way? There is a
>>> chance that strings might be represented differently in Hive and Spark, and
>>> therefore the resulting sketches might be disjoint while you might believe
>>> that they should represent overlapping sets. The crucial part of these
>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>> the sketches are not compatible. They will represent disjoint sets.
>>> I would suggest trying a simple test: build sketches from a few
>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>> union of those sketches does not grow.
>>>
>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The sketches are string-fed.
>>>>
>>>> Some of the sketches are built using Spark and the Hive functions from
>>>> the datasketches library, while others are built using a kafka streams job.
>>>> It's quite likely the covered period contains some sketches built by Spark
>>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>>> The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>> The streaming job is building the sketches through Union objects
>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>> forwards the result as sketch).
>>>>
>>>> After some adjustments to the queries I'm running to get the exact
>>>> counts, to take care of local times, etc..., these should be the correct
>>>> values with excluded days:
>>>> Without first day: 24890
>>>> Without first and second day: 22989
>>>>
>>>> Thanks,
>>>> Marko
>>>>
>>>>
>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>
>>>>> Hi Marko,
>>>>> I notice that the first two sketches are the result of union
>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>> Could you perform Jon's request again except excluding the first two
>>>>> sketches?
>>>>>
>>>>> Just to cover the bases, could you explain the types of the
>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>> strings, longs or what?
>>>>>
>>>>> Thanks,
>>>>> Lee.
>>>>>
>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>> questions.
>>>>>>
>>>>>>   jon
>>>>>>
>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jon,
>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>> without the first sketch is 24765.
>>>>>>>
>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>>> within 2%.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marko
>>>>>>>
>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Marko,
>>>>>>>>
>>>>>>>> Could you please let us know two more things:
>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>> jump?
>>>>>>>> 2) What is the exact unique count of the others without that sketch?
>>>>>>>>
>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>   jon
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>> union some HLL sketches?
>>>>>>>>>
>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>> exact counts.
>>>>>>>>>
>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>
>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>
>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>> exact count.
>>>>>>>>>
>>>>>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>>>>>> doing wrong with the unions?
>>>>>>>>>
>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>> and the exact counts:
>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>> Marko Musnjak
>>>>>>>>>
>>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by Marko Mušnjak <ma...@gmail.com>.
Hi,

It does seem the first two days (probably from Spark+Hive UDFs) merged by
themselves, closely match the exact count of 11034. The other 12 days
(built using Kafka Streams) taken together also closely match the exact
count for the period.

That would mean we have our cause here.

Now to discover how strings are represented in Spark's input files and in
Avro records in Kafka... I see the
org.apache.datasketches.hive.hll.SketchState::update converts strings to
char array, while just updating with String in
org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
and hashes the resulting byte array. Maybe trying with converting strings
in the Kafka Streams app to char[] will be a good first step.

I'll give that a try and report back.

Thanks everyone for your help in finding the source of this!

Kind regards,
Marko

On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:

> Hi Marko,
>
> As I stated before the first 2 sketches are the result of union
> operations, while the rest are not.  I get the following:
>
> All 14 sketches : 34530
> Without the first day : 27501; your count 24890;  Error = 10.5%   This is
> already way off. it represents an error of nearly 7 standard deviations,
> which is huge!
> Without the first and second day : 22919;  your count 22989; Error =
> -0.3%   This is well within the error bounds.
>
> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
> the same results with our C++ library.  Also, the C++ library was
> redesigned from the ground up.  I think it is highly unlikely we would have
> such a serious bug in all three versions without it being detected
> elsewhere.
>
> I think Alex is on the right track.  If you encode the same input IDs
> differently in two different environments they are essentially distinct
> from each other causing the unique count to go up.
>
> Please let us know what you find out.
>
> Cheers,
>
> Lee.
>
>
>
>
> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
> saydakov@verizonmedia.com> wrote:
>
>> Since you are mixing sketches built in different environments, have you
>> ever tested that the input strings are hashed the same way? There is a
>> chance that strings might be represented differently in Hive and Spark, and
>> therefore the resulting sketches might be disjoint while you might believe
>> that they should represent overlapping sets. The crucial part of these
>> sketches is the MurMur3 hash of the input. If hashes are different,
>> the sketches are not compatible. They will represent disjoint sets.
>> I would suggest trying a simple test: build sketches from a few
>> predefined strings like "a", "b" and "c" in both systems and see if the
>> union of those sketches does not grow.
>>
>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> The sketches are string-fed.
>>>
>>> Some of the sketches are built using Spark and the Hive functions from
>>> the datasketches library, while others are built using a kafka streams job.
>>> It's quite likely the covered period contains some sketches built by Spark
>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>> The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
>>> The streaming job is building the sketches through Union objects
>>> (receives a stream of sketches, makes unions out of individual pairs,
>>> forwards the result as sketch).
>>>
>>> After some adjustments to the queries I'm running to get the exact
>>> counts, to take care of local times, etc..., these should be the correct
>>> values with excluded days:
>>> Without first day: 24890
>>> Without first and second day: 22989
>>>
>>> Thanks,
>>> Marko
>>>
>>>
>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>
>>>> Hi Marko,
>>>> I notice that the first two sketches are the result of union
>>>> operations, while the remaining sketches are pure streaming sketches.
>>>> Could you perform Jon's request again except excluding the first two
>>>> sketches?
>>>>
>>>> Just to cover the bases, could you explain the types of the
>>>> data items that are being fed to the sketches?  Are your identifiers
>>>> strings, longs or what?
>>>>
>>>> Thanks,
>>>> Lee.
>>>>
>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>> questions.
>>>>>
>>>>>   jon
>>>>>
>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jon,
>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>> without the first sketch is 24765.
>>>>>>
>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>> within 2%.
>>>>>>
>>>>>> Thanks,
>>>>>> Marko
>>>>>>
>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Marko,
>>>>>>>
>>>>>>> Could you please let us know two more things:
>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>> jump?
>>>>>>> 2) What is the exact unique count of the others without that sketch?
>>>>>>>
>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>   jon
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>> union some HLL sketches?
>>>>>>>>
>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for each
>>>>>>>> of them. All the individual sketches give estimates within 2% of the exact
>>>>>>>> counts.
>>>>>>>>
>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>
>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>>>>>>> results that are within 2.5% of the exact counts.
>>>>>>>>
>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>> exact count.
>>>>>>>>
>>>>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>>>>> doing wrong with the unions?
>>>>>>>>
>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>> and the exact counts:
>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>> Marko Musnjak
>>>>>>>>
>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
Hi Marko,

As I stated before the first 2 sketches are the result of union operations,
while the rest are not.  I get the following:

All 14 sketches : 34530
Without the first day : 27501; your count 24890;  Error = 10.5%   This is
already way off. it represents an error of nearly 7 standard deviations,
which is huge!
Without the first and second day : 22919;  your count 22989; Error = -0.3%
 This is well within the error bounds.

I get the same results with Library versions 1.2.0 and 1.3.0 and we get the
same results with our C++ library.  Also, the C++ library was redesigned
from the ground up.  I think it is highly unlikely we would have such a
serious bug in all three versions without it being detected elsewhere.

I think Alex is on the right track.  If you encode the same input IDs
differently in two different environments they are essentially distinct
from each other causing the unique count to go up.

Please let us know what you find out.

Cheers,

Lee.




On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
saydakov@verizonmedia.com> wrote:

> Since you are mixing sketches built in different environments, have you
> ever tested that the input strings are hashed the same way? There is a
> chance that strings might be represented differently in Hive and Spark, and
> therefore the resulting sketches might be disjoint while you might believe
> that they should represent overlapping sets. The crucial part of these
> sketches is the MurMur3 hash of the input. If hashes are different,
> the sketches are not compatible. They will represent disjoint sets.
> I would suggest trying a simple test: build sketches from a few predefined
> strings like "a", "b" and "c" in both systems and see if the union of those
> sketches does not grow.
>
> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> The sketches are string-fed.
>>
>> Some of the sketches are built using Spark and the Hive functions from
>> the datasketches library, while others are built using a kafka streams job.
>> It's quite likely the covered period contains some sketches built by Spark
>> and some by the streaming job, but I can't tell where the exact cutoff was.
>> The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
>> The streaming job is building the sketches through Union objects
>> (receives a stream of sketches, makes unions out of individual pairs,
>> forwards the result as sketch).
>>
>> After some adjustments to the queries I'm running to get the exact
>> counts, to take care of local times, etc..., these should be the correct
>> values with excluded days:
>> Without first day: 24890
>> Without first and second day: 22989
>>
>> Thanks,
>> Marko
>>
>>
>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>
>>> Hi Marko,
>>> I notice that the first two sketches are the result of union operations,
>>> while the remaining sketches are pure streaming sketches.
>>> Could you perform Jon's request again except excluding the first two
>>> sketches?
>>>
>>> Just to cover the bases, could you explain the types of the
>>> data items that are being fed to the sketches?  Are your identifiers
>>> strings, longs or what?
>>>
>>> Thanks,
>>> Lee.
>>>
>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>> wrote:
>>>
>>>> Thanks! We're investigating. We'll let you know if we have further
>>>> questions.
>>>>
>>>>   jon
>>>>
>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jon,
>>>>> The first sketch is the one where I see the jump. The exact count
>>>>> without the first sketch is 24765.
>>>>>
>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>> within 2%.
>>>>>
>>>>> Thanks,
>>>>> Marko
>>>>>
>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com> wrote:
>>>>>
>>>>>> Hi Marko,
>>>>>>
>>>>>> Could you please let us know two more things:
>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>> jump?
>>>>>> 2) What is the exact unique count of the others without that sketch?
>>>>>>
>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>
>>>>>> Thanks,
>>>>>>   jon
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>> union some HLL sketches?
>>>>>>>
>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for each
>>>>>>> of them. All the individual sketches give estimates within 2% of the exact
>>>>>>> counts.
>>>>>>>
>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>
>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>>>>>> results that are within 2.5% of the exact counts.
>>>>>>>
>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>> exact count.
>>>>>>>
>>>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>>>> doing wrong with the unions?
>>>>>>>
>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches are
>>>>>>> in the linked gist (hex encoded, one per line):
>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>> and the exact counts:
>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>
>>>>>>> Thank you!
>>>>>>> Marko Musnjak
>>>>>>>
>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by Alexander Saydakov <sa...@verizonmedia.com>.
Since you are mixing sketches built in different environments, have you
ever tested that the input strings are hashed the same way? There is a
chance that strings might be represented differently in Hive and Spark, and
therefore the resulting sketches might be disjoint while you might believe
that they should represent overlapping sets. The crucial part of these
sketches is the MurMur3 hash of the input. If hashes are different,
the sketches are not compatible. They will represent disjoint sets.
I would suggest trying a simple test: build sketches from a few predefined
strings like "a", "b" and "c" in both systems and see if the union of those
sketches does not grow.

On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
wrote:

> Hi,
>
> The sketches are string-fed.
>
> Some of the sketches are built using Spark and the Hive functions from the
> datasketches library, while others are built using a kafka streams job.
> It's quite likely the covered period contains some sketches built by Spark
> and some by the streaming job, but I can't tell where the exact cutoff was.
> The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
> The streaming job is building the sketches through Union objects (receives
> a stream of sketches, makes unions out of individual pairs, forwards the
> result as sketch).
>
> After some adjustments to the queries I'm running to get the exact counts,
> to take care of local times, etc..., these should be the correct values
> with excluded days:
> Without first day: 24890
> Without first and second day: 22989
>
> Thanks,
> Marko
>
>
> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>
>> Hi Marko,
>> I notice that the first two sketches are the result of union operations,
>> while the remaining sketches are pure streaming sketches.
>> Could you perform Jon's request again except excluding the first two
>> sketches?
>>
>> Just to cover the bases, could you explain the types of the
>> data items that are being fed to the sketches?  Are your identifiers
>> strings, longs or what?
>>
>> Thanks,
>> Lee.
>>
>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com> wrote:
>>
>>> Thanks! We're investigating. We'll let you know if we have further
>>> questions.
>>>
>>>   jon
>>>
>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jon,
>>>> The first sketch is the one where I see the jump. The exact count
>>>> without the first sketch is 24765.
>>>>
>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>> within 2%.
>>>>
>>>> Thanks,
>>>> Marko
>>>>
>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com> wrote:
>>>>
>>>>> Hi Marko,
>>>>>
>>>>> Could you please let us know two more things:
>>>>> 1) Which is the one particular sketch that causes the estimate to jump?
>>>>> 2) What is the exact unique count of the others without that sketch?
>>>>>
>>>>> It sort of seems like the first sketch, but it's hard to know for sure
>>>>> since we don't know the true leave-one-out exact counts.
>>>>>
>>>>> Thanks,
>>>>>   jon
>>>>>
>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>> union some HLL sketches?
>>>>>>
>>>>>> I have 14 HLL sketches, and I know the exact unique counts for each
>>>>>> of them. All the individual sketches give estimates within 2% of the exact
>>>>>> counts.
>>>>>>
>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>
>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>>>>> results that are within 2.5% of the exact counts.
>>>>>>
>>>>>> Also, one particular sketch seems to cause the final estimate to jump
>>>>>> - not adding that sketch to the union keeps the result close to the exact
>>>>>> count.
>>>>>>
>>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>>> doing wrong with the unions?
>>>>>>
>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches are
>>>>>> in the linked gist (hex encoded, one per line):
>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>> and the exact counts:
>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>
>>>>>> Thank you!
>>>>>> Marko Musnjak
>>>>>>
>>>>>>

Re: HLL Union and lgK config

Posted by Marko Mušnjak <ma...@gmail.com>.
Hi,

The sketches are string-fed.

Some of the sketches are built using Spark and the Hive functions from the
datasketches library, while others are built using a kafka streams job.
It's quite likely the covered period contains some sketches built by Spark
and some by the streaming job, but I can't tell where the exact cutoff was.
The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
The streaming job is building the sketches through Union objects (receives
a stream of sketches, makes unions out of individual pairs, forwards the
result as sketch).

After some adjustments to the queries I'm running to get the exact counts,
to take care of local times, etc..., these should be the correct values
with excluded days:
Without first day: 24890
Without first and second day: 22989

Thanks,
Marko


On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:

> Hi Marko,
> I notice that the first two sketches are the result of union operations,
> while the remaining sketches are pure streaming sketches.
> Could you perform Jon's request again except excluding the first two
> sketches?
>
> Just to cover the bases, could you explain the types of the
> data items that are being fed to the sketches?  Are your identifiers
> strings, longs or what?
>
> Thanks,
> Lee.
>
> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com> wrote:
>
>> Thanks! We're investigating. We'll let you know if we have further
>> questions.
>>
>>   jon
>>
>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>> wrote:
>>
>>> Hi Jon,
>>> The first sketch is the one where I see the jump. The exact count
>>> without the first sketch is 24765.
>>>
>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>> within 2%.
>>>
>>> Thanks,
>>> Marko
>>>
>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com> wrote:
>>>
>>>> Hi Marko,
>>>>
>>>> Could you please let us know two more things:
>>>> 1) Which is the one particular sketch that causes the estimate to jump?
>>>> 2) What is the exact unique count of the others without that sketch?
>>>>
>>>> It sort of seems like the first sketch, but it's hard to know for sure
>>>> since we don't know the true leave-one-out exact counts.
>>>>
>>>> Thanks,
>>>>   jon
>>>>
>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Could someone help me understand a behavior I see when trying to union
>>>>> some HLL sketches?
>>>>>
>>>>> I have 14 HLL sketches, and I know the exact unique counts for each of
>>>>> them. All the individual sketches give estimates within 2% of the exact
>>>>> counts.
>>>>>
>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>
>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>>>> results that are within 2.5% of the exact counts.
>>>>>
>>>>> Also, one particular sketch seems to cause the final estimate to jump
>>>>> - not adding that sketch to the union keeps the result close to the exact
>>>>> count.
>>>>>
>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>> doing wrong with the unions?
>>>>>
>>>>> Running on Java, using version 1.3.0. Just in case, the sketches are
>>>>> in the linked gist (hex encoded, one per line):
>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>> and the exact counts:
>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>
>>>>> Thank you!
>>>>> Marko Musnjak
>>>>>
>>>>>

Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
Hi Marko,
I notice that the first two sketches are the result of union operations,
while the remaining sketches are pure streaming sketches.
Could you perform Jon's request again except excluding the first two
sketches?

Just to cover the bases, could you explain the types of the data items that
are being fed to the sketches?  Are your identifiers strings, longs or what?

Thanks,
Lee.

On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com> wrote:

> Thanks! We're investigating. We'll let you know if we have further
> questions.
>
>   jon
>
> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi Jon,
>> The first sketch is the one where I see the jump. The exact count without
>> the first sketch is 24765.
>>
>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>> within 2%.
>>
>> Thanks,
>> Marko
>>
>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com> wrote:
>>
>>> Hi Marko,
>>>
>>> Could you please let us know two more things:
>>> 1) Which is the one particular sketch that causes the estimate to jump?
>>> 2) What is the exact unique count of the others without that sketch?
>>>
>>> It sort of seems like the first sketch, but it's hard to know for sure
>>> since we don't know the true leave-one-out exact counts.
>>>
>>> Thanks,
>>>   jon
>>>
>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Could someone help me understand a behavior I see when trying to union
>>>> some HLL sketches?
>>>>
>>>> I have 14 HLL sketches, and I know the exact unique counts for each of
>>>> them. All the individual sketches give estimates within 2% of the exact
>>>> counts.
>>>>
>>>> When I try to create a union, using the default lgMaxK parameter
>>>> results in total estimate that is way off (25% larger then exact count).
>>>>
>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>>> results that are within 2.5% of the exact counts.
>>>>
>>>> Also, one particular sketch seems to cause the final estimate to jump -
>>>> not adding that sketch to the union keeps the result close to the exact
>>>> count.
>>>>
>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>> doing wrong with the unions?
>>>>
>>>> Running on Java, using version 1.3.0. Just in case, the sketches are in
>>>> the linked gist (hex encoded, one per line):
>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>> and the exact counts:
>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>
>>>> Thank you!
>>>> Marko Musnjak
>>>>
>>>>

Re: HLL Union and lgK config

Posted by Jon Malkin <jo...@gmail.com>.
Thanks! We're investigating. We'll let you know if we have further
questions.

  jon

On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
wrote:

> Hi Jon,
> The first sketch is the one where I see the jump. The exact count without
> the first sketch is 24765.
>
> The result for lgK=12 without the first sketch is 11% off, lgK=5 is within
> 2%.
>
> Thanks,
> Marko
>
> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com> wrote:
>
>> Hi Marko,
>>
>> Could you please let us know two more things:
>> 1) Which is the one particular sketch that causes the estimate to jump?
>> 2) What is the exact unique count of the others without that sketch?
>>
>> It sort of seems like the first sketch, but it's hard to know for sure
>> since we don't know the true leave-one-out exact counts.
>>
>> Thanks,
>>   jon
>>
>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Could someone help me understand a behavior I see when trying to union
>>> some HLL sketches?
>>>
>>> I have 14 HLL sketches, and I know the exact unique counts for each of
>>> them. All the individual sketches give estimates within 2% of the exact
>>> counts.
>>>
>>> When I try to create a union, using the default lgMaxK parameter results
>>> in total estimate that is way off (25% larger then exact count).
>>>
>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>> results that are within 2.5% of the exact counts.
>>>
>>> Also, one particular sketch seems to cause the final estimate to jump -
>>> not adding that sketch to the union keeps the result close to the exact
>>> count.
>>>
>>> Am I just seeing a very bad random error, or is there anything I'm doing
>>> wrong with the unions?
>>>
>>> Running on Java, using version 1.3.0. Just in case, the sketches are in
>>> the linked gist (hex encoded, one per line):
>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>> and the exact counts:
>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>
>>> Thank you!
>>> Marko Musnjak
>>>
>>>

Re: HLL Union and lgK config

Posted by Marko Mušnjak <ma...@gmail.com>.
Hi Jon,
The first sketch is the one where I see the jump. The exact count without
the first sketch is 24765.

The result for lgK=12 without the first sketch is 11% off, lgK=5 is within
2%.

Thanks,
Marko

On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com> wrote:

> Hi Marko,
>
> Could you please let us know two more things:
> 1) Which is the one particular sketch that causes the estimate to jump?
> 2) What is the exact unique count of the others without that sketch?
>
> It sort of seems like the first sketch, but it's hard to know for sure
> since we don't know the true leave-one-out exact counts.
>
> Thanks,
>   jon
>
> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Could someone help me understand a behavior I see when trying to union
>> some HLL sketches?
>>
>> I have 14 HLL sketches, and I know the exact unique counts for each of
>> them. All the individual sketches give estimates within 2% of the exact
>> counts.
>>
>> When I try to create a union, using the default lgMaxK parameter results
>> in total estimate that is way off (25% larger then exact count).
>>
>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>> results that are within 2.5% of the exact counts.
>>
>> Also, one particular sketch seems to cause the final estimate to jump -
>> not adding that sketch to the union keeps the result close to the exact
>> count.
>>
>> Am I just seeing a very bad random error, or is there anything I'm doing
>> wrong with the unions?
>>
>> Running on Java, using version 1.3.0. Just in case, the sketches are in
>> the linked gist (hex encoded, one per line):
>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>> and the exact counts:
>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>
>> Thank you!
>> Marko Musnjak
>>
>>

Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
Marko,
  We are working to understand this problem.  Thank you for sending us the
actual sketches, That helps us a great deal!

Cheers,

Lee.

On Thu, Aug 13, 2020 at 3:24 PM Jon Malkin <jo...@gmail.com> wrote:

> Hi Marko,
>
> Could you please let us know two more things:
> 1) Which is the one particular sketch that causes the estimate to jump?
> 2) What is the exact unique count of the others without that sketch?
>
> It sort of seems like the first sketch, but it's hard to know for sure
> since we don't know the true leave-one-out exact counts.
>
> Thanks,
>   jon
>
> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Could someone help me understand a behavior I see when trying to union
>> some HLL sketches?
>>
>> I have 14 HLL sketches, and I know the exact unique counts for each of
>> them. All the individual sketches give estimates within 2% of the exact
>> counts.
>>
>> When I try to create a union, using the default lgMaxK parameter results
>> in total estimate that is way off (25% larger then exact count).
>>
>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>> results that are within 2.5% of the exact counts.
>>
>> Also, one particular sketch seems to cause the final estimate to jump -
>> not adding that sketch to the union keeps the result close to the exact
>> count.
>>
>> Am I just seeing a very bad random error, or is there anything I'm doing
>> wrong with the unions?
>>
>> Running on Java, using version 1.3.0. Just in case, the sketches are in
>> the linked gist (hex encoded, one per line):
>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>> and the exact counts:
>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>
>> Thank you!
>> Marko Musnjak
>>
>>

Re: HLL Union and lgK config

Posted by Jon Malkin <jo...@gmail.com>.
Hi Marko,

Could you please let us know two more things:
1) Which is the one particular sketch that causes the estimate to jump?
2) What is the exact unique count of the others without that sketch?

It sort of seems like the first sketch, but it's hard to know for sure
since we don't know the true leave-one-out exact counts.

Thanks,
  jon

On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <ma...@gmail.com>
wrote:

> Hi,
>
> Could someone help me understand a behavior I see when trying to union
> some HLL sketches?
>
> I have 14 HLL sketches, and I know the exact unique counts for each of
> them. All the individual sketches give estimates within 2% of the exact
> counts.
>
> When I try to create a union, using the default lgMaxK parameter results
> in total estimate that is way off (25% larger then exact count).
>
> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
> results that are within 2.5% of the exact counts.
>
> Also, one particular sketch seems to cause the final estimate to jump -
> not adding that sketch to the union keeps the result close to the exact
> count.
>
> Am I just seeing a very bad random error, or is there anything I'm doing
> wrong with the unions?
>
> Running on Java, using version 1.3.0. Just in case, the sketches are in
> the linked gist (hex encoded, one per line):
> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
> and the exact counts:
> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>
> Thank you!
> Marko Musnjak
>
>