You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@datasketches.apache.org by Marko Mušnjak <ma...@gmail.com> on 2020/09/14 07:31:07 UTC

Re: [E] Re: HLL Union and lgK config

Hi,
I just wanted to confirm that simply converting the strings to charArray
worked fine - the sketches from the hive library merged with the kstreams
sketches now produce correct results.

Thanks again for the help!

On Fri, 14 Aug 2020 at 22:51, Marko Mušnjak <ma...@gmail.com> wrote:

> Hi,
>
> It does seem the first two days (probably from Spark+Hive UDFs) merged by
> themselves, closely match the exact count of 11034. The other 12 days
> (built using Kafka Streams) taken together also closely match the exact
> count for the period.
>
> That would mean we have our cause here.
>
> Now to discover how strings are represented in Spark's input files and in
> Avro records in Kafka... I see the
> org.apache.datasketches.hive.hll.SketchState::update converts strings to
> char array, while just updating with String in
> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
> and hashes the resulting byte array. Maybe trying with converting strings
> in the Kafka Streams app to char[] will be a good first step.
>
> I'll give that a try and report back.
>
> Thanks everyone for your help in finding the source of this!
>
> Kind regards,
> Marko
>
> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>
>> Hi Marko,
>>
>> As I stated before the first 2 sketches are the result of union
>> operations, while the rest are not.  I get the following:
>>
>> All 14 sketches : 34530
>> Without the first day : 27501; your count 24890;  Error = 10.5%   This is
>> already way off. it represents an error of nearly 7 standard deviations,
>> which is huge!
>> Without the first and second day : 22919;  your count 22989; Error =
>> -0.3%   This is well within the error bounds.
>>
>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>> the same results with our C++ library.  Also, the C++ library was
>> redesigned from the ground up.  I think it is highly unlikely we would have
>> such a serious bug in all three versions without it being detected
>> elsewhere.
>>
>> I think Alex is on the right track.  If you encode the same input IDs
>> differently in two different environments they are essentially distinct
>> from each other causing the unique count to go up.
>>
>> Please let us know what you find out.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>> saydakov@verizonmedia.com> wrote:
>>
>>> Since you are mixing sketches built in different environments, have you
>>> ever tested that the input strings are hashed the same way? There is a
>>> chance that strings might be represented differently in Hive and Spark, and
>>> therefore the resulting sketches might be disjoint while you might believe
>>> that they should represent overlapping sets. The crucial part of these
>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>> the sketches are not compatible. They will represent disjoint sets.
>>> I would suggest trying a simple test: build sketches from a few
>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>> union of those sketches does not grow.
>>>
>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The sketches are string-fed.
>>>>
>>>> Some of the sketches are built using Spark and the Hive functions from
>>>> the datasketches library, while others are built using a kafka streams job.
>>>> It's quite likely the covered period contains some sketches built by Spark
>>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>>> The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>> The streaming job is building the sketches through Union objects
>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>> forwards the result as sketch).
>>>>
>>>> After some adjustments to the queries I'm running to get the exact
>>>> counts, to take care of local times, etc..., these should be the correct
>>>> values with excluded days:
>>>> Without first day: 24890
>>>> Without first and second day: 22989
>>>>
>>>> Thanks,
>>>> Marko
>>>>
>>>>
>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>
>>>>> Hi Marko,
>>>>> I notice that the first two sketches are the result of union
>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>> Could you perform Jon's request again except excluding the first two
>>>>> sketches?
>>>>>
>>>>> Just to cover the bases, could you explain the types of the
>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>> strings, longs or what?
>>>>>
>>>>> Thanks,
>>>>> Lee.
>>>>>
>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>> questions.
>>>>>>
>>>>>>   jon
>>>>>>
>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jon,
>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>> without the first sketch is 24765.
>>>>>>>
>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>>> within 2%.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marko
>>>>>>>
>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Marko,
>>>>>>>>
>>>>>>>> Could you please let us know two more things:
>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>> jump?
>>>>>>>> 2) What is the exact unique count of the others without that sketch?
>>>>>>>>
>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>   jon
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>> union some HLL sketches?
>>>>>>>>>
>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>> exact counts.
>>>>>>>>>
>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>
>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>
>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>> exact count.
>>>>>>>>>
>>>>>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>>>>>> doing wrong with the unions?
>>>>>>>>>
>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>> and the exact counts:
>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>> Marko Musnjak
>>>>>>>>>
>>>>>>>>>

Re: [E] Re: HLL Union and lgK config

Posted by leerho <le...@gmail.com>.
Thanks for the update!

On Mon, Sep 14, 2020 at 12:31 AM Marko Mušnjak <ma...@gmail.com>
wrote:

> Hi,
> I just wanted to confirm that simply converting the strings to charArray
> worked fine - the sketches from the hive library merged with the kstreams
> sketches now produce correct results.
>
> Thanks again for the help!
>
> On Fri, 14 Aug 2020 at 22:51, Marko Mušnjak <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> It does seem the first two days (probably from Spark+Hive UDFs) merged by
>> themselves, closely match the exact count of 11034. The other 12 days
>> (built using Kafka Streams) taken together also closely match the exact
>> count for the period.
>>
>> That would mean we have our cause here.
>>
>> Now to discover how strings are represented in Spark's input files and in
>> Avro records in Kafka... I see the
>> org.apache.datasketches.hive.hll.SketchState::update converts strings to
>> char array, while just updating with String in
>> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
>> and hashes the resulting byte array. Maybe trying with converting strings
>> in the Kafka Streams app to char[] will be a good first step.
>>
>> I'll give that a try and report back.
>>
>> Thanks everyone for your help in finding the source of this!
>>
>> Kind regards,
>> Marko
>>
>> On Fri, 14 Aug 2020 at 20:58, leerho <le...@gmail.com> wrote:
>>
>>> Hi Marko,
>>>
>>> As I stated before the first 2 sketches are the result of union
>>> operations, while the rest are not.  I get the following:
>>>
>>> All 14 sketches : 34530
>>> Without the first day : 27501; your count 24890;  Error = 10.5%   This
>>> is already way off. it represents an error of nearly 7 standard deviations,
>>> which is huge!
>>> Without the first and second day : 22919;  your count 22989; Error =
>>> -0.3%   This is well within the error bounds.
>>>
>>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>>> the same results with our C++ library.  Also, the C++ library was
>>> redesigned from the ground up.  I think it is highly unlikely we would have
>>> such a serious bug in all three versions without it being detected
>>> elsewhere.
>>>
>>> I think Alex is on the right track.  If you encode the same input IDs
>>> differently in two different environments they are essentially distinct
>>> from each other causing the unique count to go up.
>>>
>>> Please let us know what you find out.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>>> saydakov@verizonmedia.com> wrote:
>>>
>>>> Since you are mixing sketches built in different environments, have you
>>>> ever tested that the input strings are hashed the same way? There is a
>>>> chance that strings might be represented differently in Hive and Spark, and
>>>> therefore the resulting sketches might be disjoint while you might believe
>>>> that they should represent overlapping sets. The crucial part of these
>>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>>> the sketches are not compatible. They will represent disjoint sets.
>>>> I would suggest trying a simple test: build sketches from a few
>>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>>> union of those sketches does not grow.
>>>>
>>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The sketches are string-fed.
>>>>>
>>>>> Some of the sketches are built using Spark and the Hive functions from
>>>>> the datasketches library, while others are built using a kafka streams job.
>>>>> It's quite likely the covered period contains some sketches built by Spark
>>>>> and some by the streaming job, but I can't tell where the exact cutoff was.
>>>>> The Spark job is using
>>>>> org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>>> The streaming job is building the sketches through Union objects
>>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>>> forwards the result as sketch).
>>>>>
>>>>> After some adjustments to the queries I'm running to get the exact
>>>>> counts, to take care of local times, etc..., these should be the correct
>>>>> values with excluded days:
>>>>> Without first day: 24890
>>>>> Without first and second day: 22989
>>>>>
>>>>> Thanks,
>>>>> Marko
>>>>>
>>>>>
>>>>> On Fri, 14 Aug 2020 at 17:08, leerho <le...@gmail.com> wrote:
>>>>>
>>>>>> Hi Marko,
>>>>>> I notice that the first two sketches are the result of union
>>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>>> Could you perform Jon's request again except excluding the first two
>>>>>> sketches?
>>>>>>
>>>>>> Just to cover the bases, could you explain the types of the
>>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>>> strings, longs or what?
>>>>>>
>>>>>> Thanks,
>>>>>> Lee.
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>>> questions.
>>>>>>>
>>>>>>>   jon
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <
>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Jon,
>>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>>> without the first sketch is 24765.
>>>>>>>>
>>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>>>>>>> within 2%.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Marko
>>>>>>>>
>>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Marko,
>>>>>>>>>
>>>>>>>>> Could you please let us know two more things:
>>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>>> jump?
>>>>>>>>> 2) What is the exact unique count of the others without that
>>>>>>>>> sketch?
>>>>>>>>>
>>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>   jon
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>>> marko.musnjak@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>>> union some HLL sketches?
>>>>>>>>>>
>>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>>> each of them. All the individual sketches give estimates within 2% of the
>>>>>>>>>> exact counts.
>>>>>>>>>>
>>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>>>>>>
>>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>>
>>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>>> jump - not adding that sketch to the union keeps the result close to the
>>>>>>>>>> exact count.
>>>>>>>>>>
>>>>>>>>>> Am I just seeing a very bad random error, or is there anything
>>>>>>>>>> I'm doing wrong with the unions?
>>>>>>>>>>
>>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>>> and the exact counts:
>>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>> Marko Musnjak
>>>>>>>>>>
>>>>>>>>>>