You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@datasketches.apache.org by leerho <le...@gmail.com> on 2020/09/09 00:19:11 UTC

Re: Java code for Relative Error Quantiles Sketch

I am making good progress on the new Relative Error Quantiles
<https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles>
sketch.  This is in the "RelativeErrorQuantiles" branch, not "master". It
is not ready to be checked in yet, but it is getting close.  It would be
wonderful if someone could give it a whirl and/or take a look at the code.
:)

Here is what is implemented so far:

   - At sketch construction, it can be configured to prioritize accuracy
   for either the high ranks or the low ranks.
   - The default item comparison criterion is "<", which is consistent with
   our other quantile sketches. However, the user can change the criterion to
   "<=" by simply calling *setLtEq( true ).  *This will allow comparison
   with other quantile functions that may use that criterion.
   - I implemented extensive debug and "pretty printing" functions, which
   not only are useful for debugging, but should help anyone visually see what
   is going on internally.  (If anyone is interested in seeing this in action,
   let me know, and I'll send out some instructions  :)   ).  I developed a
   custom "signaling" interface to minimize clutter in the main code yet allow
   for deep analysis of what the code is doing.
   - I custom crafted two versions of a *O(n) *merge sort depending on the
   setting of highRankAccuracy (true/false).  Merge sorting is used wherever
   possible.
   - I also custom crafted a binary search algorithm to automatically
   accommodate the two comparison criteria.
   - I think most of the javadocs are complete, but I'm sure there are
   always ways to improve the documentation.
   - All the license headers are in place
   - Unit test coverage is hovering around 98% (according to my Clover).

What remains to be completed:

   - The design of the binary storage format
   - Serialization / Deserialization.
   - Comprehensive Characterization of accuracy, speed and stored size as
   functions of *k* and *N.*
   - I could really use some help with code reviews and feedback!  :) :)


If anyone needs some help in getting it booted up so you can play with it,
let me know.  Sorry, this is not Python, it does take a few more steps :)

Also, if any of you are not signed up for our *dev@* list just send a blank
email to dev-subscribe@datasketches.apache.org.  The traffic is really low
so It won't swamp your inbox!

Cheers,

Lee.

On Mon, Sep 7, 2020 at 5:57 AM Pavel Vesely <Pa...@warwick.ac.uk>
wrote:

> Hi Lee,
>
> okay, let's talk on Thursday. I'll look at your code by then.
>
> Cheers,
> Pavel
>
> On 07.09.2020 at 3:40 leerho wrote:
> > Yes, let's meet on Thursday 9:00AM PDT (UTC -7).
> >
> > 1. I am uncomfortable with changing the getMaximumRSE(k) method by a
> factor of
> > 8 ( you suggested 1 or .9) without some rigorous testing or a proof.   (
> We
> > would lose our ability to say that our bounds are mathematically proven
> :) )
> > We should discuss this.
> >
> > 2. Just as a comment, we always return ranks as normalized ranks [0,1]
> not as
> > counts.  Not a big deal, but when you look at my code you will need to
> know
> > this  :)
> >
> > Please take a look at my code at:
> >
> https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles
> >
> > Cheers,
> >
> > Lee.
> >
> >
> >
> > On Thu, Sep 3, 2020 at 7:40 AM Vesely, Pavel <Pavel.Vesely@warwick.ac.uk
> > <ma...@warwick.ac.uk>> wrote:
> >
> >     Hi Lee,
> >
> >     I finally got back to you. Regarding your summary below, it's right
> that
> >     we don't have a mathematical analysis supporting Strategy 4, and
> there
> >     might be none (although constructing a specific example might be
> tricky).
> >
> >     I've added the remaining functions that you need, namely:
> >
> >     - getMaximumRSE(k): Returns an a priori estimate of relative standard
> >     error (RSE, expressed as a number in [0,1]), calculated as sqrt(Var)
> /
> >     rank (note that neither n, nor rank are needed). I'm using a formula
> for
> >     Var from the paper, but it seems too pessimistic (by a factor of 3),
> >     according to some experiments that I've done.
> >
> >     - getUpperBound(numStdDev) and getLowerBound(numStdDev) that provide
> a
> >     confidence interval by adding/subtracting a specified multiple of
> stdDev
> >     from the estimate
> >
> >     Let me know if you have any questions. I can also join a call next
> week
> >     (Wednesday or Thursday would be the best).
> >
> >     Cheers,
> >     Pavel
> >
> >
> >     On 26.08.2020 at 21:18 leerho wrote:
> >>     Hi Pavel,
> >>     Thanks for your analysis.  My summary of your analysis:
> >>
> >>       * Strategy 1: This would work but performs too many compactions.
> It has
> >>         the simplest prediction of sketch size.  The math obviously
> works so
> >>         a priori prediction of sketch error should be relatively
> straightforward.
> >>       * Strategy 2: The math works and has far fewer compactions than
> >>         Strategy1. You didn't comment on the difficulty in a priori
> >>         prediction of sketch size.  Because it is anticipated in the
> paper,
> >>         the a priori prediction of sketch error should also be
> straightforward.
> >>       * Strategy 3:  Comment: your restatement of Region 1 is
> equivalent to
> >>         what I called Region 1 and Region 2. I only separated out
> Region 2 to
> >>         emphasize that that is the size of the overflow. Major
> >>         disadvantages are: the math analysis doesn't support this
> strategy,
> >>         it would have as many compactions as Strategy 1, and may be
> >>         vulnerable to pathological sequences.  So Strategy 3 is
> definitely out!
> >>       * Strategy 4: (If I understood your comments...) the math does not
> >>         support this strategy.  So let's leave this one out as well.
> >>
> >>     That leaves only Strategy 2.  Which I will modify the java code to
> >>     implement.  This certainly simplifies testing which I am all in
> favor of!
> >>
> >>     Lee.
> >>
> >>
> >>     On Wed, Aug 26, 2020 at 9:22 AM Pavel Vesely <
> Pavel.Vesely@warwick.ac.uk
> >>     <ma...@warwick.ac.uk>> wrote:
> >>
> >>         Hi Lee!
> >>
> >>         Thanks for sending the links. I've done some renaming according
> to you
> >>         suggestions and added two new functions to my code:
> >>
> >>         - quantile(q): given quantile query q in [0,1], it returns an
> approx.
> >>         q-quantile
> >>
> >>         - getMaxStoredItems(k, n): this is a static method that
> estimates the
> >>         (max.
> >>         nominal) size of the sketch with parameter k on input of size
> n. It
> >>         seems to be
> >>         correct up to a 10% error (at least, when the input size and k
> are
> >>         sufficiently
> >>         large, such as k >= 16). I believe a more precise function
> would need to
> >>         simulate constructing the whole sketch ...
> >>
> >>         Regarding the strategies, I have a few comments (some of them I
> said
> >>         during the
> >>         call):
> >>
> >>         *Strategy 1:* Indeed, this strategy performs unnecessarily many
> >>         compaction
> >>         operations. The only advantage is an easy predictability of the
> final
> >>         size of
> >>         the sketch, i.e., possibly a more precise getMaxStoredItems
> function.
> >>
> >>         *Strategy 2:* its advantage is that our mathematical analysis
> works
> >>         for sure
> >>         (strictly speaking, the proof works for Strategy 1, but there
> are no
> >>         issues
> >>         introduced by laziness of Strategy 2, and actually, I think the
> more
> >>         general
> >>         analysis of Appendix D applies to Strategy 2 directly).
> >>
> >>         *Strategy 3:* I remember it slightly differently:
> >>
> >>         - Region 1 has Length - numSection * sectionSize items and is
> never
> >>         compacted;
> >>         it possibly contains some part of the "overflow" part (above
> nomCap)
> >>         - Region 2 has numSections * sectionSize largest items, split
> into
> >>         sections of
> >>         size sectionSize, and we choose some number of sections to
> compact
> >>
> >>         This has the disadvantage of compacting just a few items when
> >>         sectionSize is
> >>         small, leading to a larger number of compations (similarly as
> for
> >>         Strategy 1).
> >>         Furthermore, our mathematical analysis does not go through
> directly
> >>         and there
> >>         may be some (contrived) adversarial instances on which it has a
> large
> >>         error.
> >>
> >>         The advantage is that we protect (i.e., not compact) more items
> than
> >>         in the
> >>         previous strategies when the buffer overflow is large.
> >>
> >>         *Strategy 4:* I think Region 1 has size Length / 2, instead of
> nomCap
> >>         / 2. The
> >>         advantage is that when the buffer overflow is large we compact
> more
> >>         than in
> >>         Strategy 3 (although not as many items as in Strategy 2), while
> we
> >>         also protect
> >>         more items than in Strategy 2.
> >>
> >>         The disadvantage is that getMaxStoredItems function would be
> least
> >>         precise
> >>         because sectionSize varies over time (at least, that's my
> intuition).
> >>         As for
> >>         Strategy 3, our mathematical analysis does not go through
> directly
> >>         and there may
> >>         be some adversarial instances on which it has a large error,
> perhaps
> >>         even more
> >>         contrived that in the previous case.
> >>
> >>         So far, I have no clear recommendation, and haven't thought
> much about
> >>         calculating the confidence bounds (will get to it next week). I
> would
> >>         only
> >>         exclude Strategy 1 and Strategy 3.
> >>
> >>         Best,
> >>         Pavel
> >>
> >>         On 25.08.2020 at 23:40 leerho wrote:
> >>         > I forgot to add the links:
> >>         >
> >>         > Main Code:
> >>         >
> >>
> https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/main/java/org/apache/datasketches/req
> >>         >
> >>         > Test Code:
> >>         >
> >>
> https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/test/java/org/apache/datasketches/req
> >>         >
> >>         > On Tue, Aug 25, 2020 at 2:10 PM leerho <leerho@gmail.com
> >>         <ma...@gmail.com>
> >>         > <mailto:leerho@gmail.com <ma...@gmail.com>>> wrote:
> >>         >
> >>         >     Folks,
> >>         >
> >>         >     Here are the links to the Java main code and the Java
> test code
> >>         for the
> >>         >     new REQ sketch.  This code is in very initial state and
> is in
> >>         the branch
> >>         >     "RelativeErrorQuantiles".  It has a lot of work left
> before we
> >>         merge it
> >>         >     into master, so recognize it will be changing a lot.
> >>         >
> >>         >     Pavel and I had a long discussion about what the best ( &
> most
> >>         practical)
> >>         >     compaction strategy would be. Whatever strategy we
> choose, we
> >>         need to be
> >>         >     confident that we have mathematics that will support the
> >>         implementation
> >>         >     and that we will be able to obtain reasonable confidence
> bounds
> >>         on the
> >>         >     estimates produced by the sketch.  All of these strategies
> >>         assume that the
> >>         >     low ranks are the most accurate. (This of course can be
> changed.)
> >>         >
> >>         >     *Strategy 1.* My understanding of the paper is that it
> proposes
> >>         this
> >>         >     strategy:
> >>         >     The compactor has a maximum size given by a Capacity = 2 *
> >>         numSections *
> >>         >     sectionSize.  As soon as the compactor reaches capacity,
> the
> >>         compaction
> >>         >     process starts.
> >>         >
> >>         >       * Region1: The bottom half, of size Capacity / 2, and is
> >>         never compacted.
> >>         >       * Region2: The top half consists of /m/ sections of
> size /k
> >>         /(these can
> >>         >         change as a f(numCompactions) but ignore for this
> description)
> >>         >           o The top-most of the m sections is always
> compacted,
> >>         leaving m-1
> >>         >             sections to be potentially compacted.
> >>         >           o Of the m-1 sections, choose the top m' of these
> based
> >>         on the
> >>         >             distribution created from the number of trailing
> ones
> >>         in the
> >>         >             number of compactions that have already occurred
> in
> >>         this compactor.
> >>         >
> >>         >     This will create the most number of compactions and the
> most
> >>         levels per
> >>         >     value of N.
> >>         >
> >>         >     *Strategy 2.* This strategy is what Pavel suggested in his
> >>         Python code:
> >>         >     The compactor has a Nominal Capacity and an actual Length
> that
> >>         can exceed
> >>         >     the nomCap. The nomCap is defined as above, where nomCap
> = 2 *
> >>         numSections
> >>         >     * sectionSize for each compactor. The sketch has a
> maxSize =
> >>         >     sum(compactor[i].nomCap() ).  When this maxSize is
> exceeded the
> >>         >     compression process starts which chooses the lowest
> numbered
> >>         compactor
> >>         >     that exceeds its nomCap to compact.  If this turns out to
> be
> >>         the top
> >>         >     compactor, then a new compactor is added at the top.
> >>         >
> >>         >     By allowing a compactor to exceed its nomCap, we keep more
> >>         values at a
> >>         >     lower level improving accuracy.  But the issue is what to
> do
> >>         with the
> >>         >     "overflow" items during compaction.  This is addressed
> >>         differently by this
> >>         >     strategy and the next two.
> >>         >
> >>         >       * This compactor has 3 regions:
> >>         >           o Region1: of size nomCap / 2 is never compacted.
> >>         >           o Region2: of size numSections * sectionsSize
> >>         >           o Region3: of size Length - (Region1 + Region2);  #
> this
> >>         is the
> >>         >             "overflow" and is always compacted.
> >>         >       * In this strategy all of Region3 plus the randomly
> chosen,
> >>         top m'
> >>         >         sections of Region 2 will be compacted.
> >>         >
> >>         >
> >>         >     *Strategy 3.*  This strategy is what I have currently
> >>         implemented in Java
> >>         >     (and my version of Pavel's Python) (because I didn't fully
> >>         understand what
> >>         >     Pavel had in mind :) ):
> >>         >     This strategy is similar to Strategy 2, and also allows
> for
> >>         overflow. The
> >>         >     way the regions are treated is different:
> >>         >
> >>         >       * This compactor also has 3 regions:
> >>         >           o Region 1 of size nomCap / 2 is never compacted
> >>         >           o Region 2 is the overflow region of size Length -
> >>         nomCap. It is
> >>         >             also not compacted.
> >>         >           o Region 3 of size numSections * sectionsSize.  The
> random m'
> >>         >             sections will be compacted based on the
> >>         f(numCompactions) as above.
> >>         >
> >>         >     This keeps more values behind and would maintain better
> >>         accuracy longer,
> >>         >     but perhaps causing more compactions than #2.
> >>         >
> >>         >     *Strategy 4.*  This strategy is similar to #2 and #3, but
> again
> >>         different
> >>         >     in how the overflow values are treated:
> >>         >
> >>         >       * This compactor has 2 regions:
> >>         >           o Region 1 of size nomCap / 2 is never compacted
> >>         >           o Region 2 of size numSections * sectionSize',
> where the
> >>         >             sectionSize' is dynamically adjusted so that
> numSections *
> >>         >             sectionSize' consumes most of Length - Region1.
> The
> >>         number m' is
> >>         >             chosen as before to determine what sections to
> compact.
> >>         >
> >>         >     All of these strategies are implementable and have
> different
> >>         tradeoffs
> >>         >     with respect to accuracy, size and speed.
> >>         >
> >>         >     *Questions:*
> >>         >
> >>         >     1. Which of these strategies can provide:
> >>         >
> >>         >       * An a priori calculation of the sketch size as a f(k,
> N).
> >>         >       * An a priori calculation of estimation error with
> confidence
> >>         intervals.
> >>         >       * An a posteriori calculation of confidence intervals
> based
> >>         on the state
> >>         >         of the sketch.
> >>         >       * A defense that claims that the sketch implementation
> is
> >>         backed by
> >>         >         mathematical proofs of correct operation with provable
> >>         error properties.
> >>         >
> >>         >     2.  What is your recommendation?
> >>         >
> >>         >     Note: I may implement all 4 strategies, so that we can
> fully
> >>         characterize
> >>         >     them, but I would prefer if we could limit these options
> to one
> >>         (or two)
> >>         >     to put into production.  :)
> >>         >
> >>         >     Your help with these questions would be appreciated.
> >>         >
> >>         >     Lee.
> >>         >
> >>
>

Re: Java code for Relative Error Quantiles Sketch

Posted by "Vesely, Pavel" <Pa...@warwick.ac.uk>.

Well, in the graph you sent, the lines of LB and UB are basically bounds on r[i] - tr[i], whereas the flat curves are relative to the rank, i.e., (r[i] - tr[i]) / tr[i]. Then it's actually no surprise that their behaviors don't match.

As for the case of small K: I'm aware that the getMaxRSE function returns a value bigger than 1 in that case. This is because the formula for std. deviation is taken from the paper and has a constant factor slack, especially for a small K. I'd suggest to calculate MaxRSE based on some experiments, say, for K <= 10 (not sure if we can provide a better math formula for that case, but I'll think about it).
Cheers,
Pavel

On 2020-09-13 22:19, leerho wrote:
Still, the calculated a priori error bounds for the LRA case still look strange.

On Sun, Sep 13, 2020 at 1:09 PM leerho <le...@gmail.com>> wrote:
Pavel,
Yes, I was doing something wrong! Sorry! I was confusing the predicted bounds on a rank vs the error shown in the graphs where the absolute rank is subtracted out just showing the difference.

Nonetheless, we do need functions that do not go negative for small K.

Lee.

On Sun, Sep 13, 2020 at 12:50 PM leerho <le...@gmail.com>> wrote:
Pavel,
I have looked very carefully at your p-code for getRankUB and getRankLB, and they both produce nonsensical values.

* The getMaximumRSE(k), with K=4 (the smallest value of K) multiplied by 3 SD = 1.23, which is > 1. It should never produce values > 1 for all possible legal values of K. This will cause the LB to go negative.
* Given a K and SD, the function (1 +/- SD * RSE) is a constant. This multiplied by the rank will produce a linear increasing bounds from the lowest rank to the highest. And this does not model the error properties for the HRA case which has decreasing error for the high ranks or the LRA case where the error is pretty flat for most of the range.
* We need equations that not only predict the actual error behavior, but also for the lower bound, never produces values < 0 (for all legal values of K), which these equations will.

Comments:

* This is not a bug, but a user design principle. I don't think it is a good idea to have the user specify a value to obtain the rank error. We don't want to give the user the idea that the error bounds are related to values at all, only ranks. I realize we can translate the given value to an estimated rank, but let's have the user do that and document to the user the caveats about doing that.
* We have been training our users to become accustomed to specifying and obtaining ranks as normalized [0,1]. That is how our other quantile sketches work. Most users think in terms of percentiles anyway so normalized ranks are pretty natural.

Perhaps I am still doing something wrong?

Cheers,

Lee.

On Sat, Sep 12, 2020 at 11:26 AM leerho <le...@gmail.com>> wrote:
Hi Pavel,
I think we are in agreement. I take it as a positive that you haven't found any flaw in the implementation of the relative error quantiles algorithm. What we are discussing now is definitions in how to define rank especially with respect to HRA and LRA, and philosophically, what kind of error distribution as a function of rank will users want and be easiest to specify and explain.

My intuition is that all plots should look roughly the same, up to mirroring
along the Y axis.

This statement is a little puzzling as it will only be true if we choose definitions of rank appropriately if the user selects LRA or HRA. As my plots reveal, if we keep the definition of rank the same, switching from HRA to LRA has dramatically different error distribution as a f(rank). I agree with your formulas for relative error, except that all of our ranks are already normalized by SL, so I would replace SL in your formulas by 1.0.

I still want to add to these plots your a priori calculation of error to see where it lies compared to the measured error.

I gather from your comments that what you had in mind when writing the paper was a rank error distribution that looks like plots 3 and 4 above and not plots 1 and 2. However, I feel that the rank error distribution shown in plots 1 and 2 would be easier for users to understand. We will probably leave the decision totally up to the user as to what they prefer, however, this will require more documentation to explain all the different error behaviors :( .

Lee.

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

Still, the calculated a priori error bounds for the LRA case still look
strange.


On Sun, Sep 13, 2020 at 1:09 PM leerho <le...@gmail.com> wrote:

> Pavel,
> Yes, I was doing something wrong! Sorry! I was confusing the predicted
> bounds on a rank vs the error shown in the graphs where the absolute rank
> is subtracted out just showing the difference.
>
> Nonetheless, we do need functions that do not go negative for small K.
>
> Lee.
>
> On Sun, Sep 13, 2020 at 12:50 PM leerho <le...@gmail.com> wrote:
>
>> Pavel,
>> I have looked very carefully at your p-code for getRankUB and getRankLB,
>> and they both produce nonsensical values.
>>
>>    - The getMaximumRSE(k), with K=4 (the smallest value of K) multiplied
>>    by 3 SD = 1.23, which is > 1.  It should never produce values > 1 for all
>>    possible legal values of K. This will cause the LB to go negative.
>>    - Given a K and SD, the function (1 +/- SD * RSE) is a constant.
>>    This multiplied by the rank will produce a linear increasing bounds from
>>    the lowest rank to the highest.  And this does not model the error
>>    properties for the HRA case which has decreasing error for the high ranks
>>    or the LRA case where the error is pretty flat for most of the range.
>>    - We need equations that not only predict the actual error behavior,
>>    but also for the lower bound, never produces values < 0 (for all legal
>>    values of K), which these equations will.
>>
>> Comments:
>>
>>    - This is not a bug, but a user design principle.  I don't think it
>>    is a good idea to have the user specify a value to obtain the rank error.
>>    We don't want to give the user the idea that the error bounds are related
>>    to values at all, only ranks.  I realize we can translate the given value
>>    to an estimated rank, but let's have the user do that and document to the
>>    user the caveats about doing that.
>>    - We have been training our users to become accustomed to specifying
>>    and obtaining ranks as normalized [0,1]. That is how our other quantile
>>    sketches work.  Most users think in terms of percentiles anyway so
>>    normalized ranks are pretty natural.
>>
>> Perhaps I am still doing something wrong?
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, Sep 12, 2020 at 11:26 AM leerho <le...@gmail.com> wrote:
>>
>>> Hi Pavel,
>>> I think we are in agreement.  I take it as a positive that you haven't
>>> found any flaw in the implementation of the relative error quantiles
>>> algorithm.   What we are discussing now is definitions in how to define
>>> rank especially with respect to HRA and LRA, and philosophically, what kind
>>> of error distribution as a function of rank will users want and be easiest
>>> to specify and explain.
>>>
>>> My intuition is that all plots should look roughly the same, up to
>>>> mirroring
>>>> along the Y axis.
>>>
>>>
>>> This statement is a little puzzling as it will only be true if we choose
>>> definitions of rank appropriately if the user selects LRA or HRA.  As my
>>> plots reveal, if we keep the definition of rank the same, switching  from
>>> HRA to LRA has dramatically different error distribution as a  f(rank).  I
>>> agree with your formulas for relative error, except that all of our ranks
>>> are already normalized by SL, so I would replace SL in your formulas by
>>> 1.0.
>>>
>>> I still want to add to these plots your a priori calculation of error to
>>> see where it lies compared to the measured error.
>>>
>>> I gather from your comments that what you had in mind when writing the
>>> paper was a rank error distribution that looks like plots 3 and 4 above and
>>> not plots 1 and 2.  However, I feel that the rank error distribution shown
>>> in plots 1 and 2 would be easier for users to understand.  We will probably
>>> leave the decision totally up to the user as to what they prefer, however,
>>> this will require more documentation to explain all the different error
>>> behaviors  :( .
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: Java code for Relative Error Quantiles Sketch

Posted by "Vesely, Pavel" <Pa...@warwick.ac.uk>.

Hi all!

FYI, I'm sending some statistics on the size of our sketch RE-KLL (produced by my Python implementation), depending on the stream length N and accuracy parameter k. More precisely, I'm plotting the maximal nominal capacity (MaxNomSize) of the sketch, which is very close to the true number of stored items (up to O(k) items). I've tried all N = 10^L for L = 3, 4, ..., 11.

Even though the size seems to be linear in log(N) on the first sight, when you look at the derivative, it's actually slowly increasing as N grows. I think this confirms the log^{1.5}(N) in the space bound in the paper. However, it seems strange that the derivative is not monotone.

(The test for N = 10^11 and k = 25 is still running and should finish in a few days. I can run tests for N = 10^12 if you'd like, the results will be produced by Christmas or so :-) Perhaps, better to use the Java implementation then.)

Cheers,
Pavel

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

This is all good feedback.  Thank you all.  The plots that I have sent out
were just my first attempt at characterization, they are by no means cast
in stone!  I will definitely try Pavel's redefinition of error as well as
increasing the density of plot points in the tail.

Lee



On Mon, Sep 14, 2020 at 7:55 AM Vesely, Pavel <Pa...@warwick.ac.uk>
wrote:

> Hi Justin,
> regarding K and epsilon: K is actually an initial setting of section size
> and it should be equal to Theta(1 / epsilon), regardless of the stream
> length SL. This is because SL is not known in advance and we maintain the
> correct section size dynamically. That is, each compactor starts assuming
> that SL is small (i.e. SL = O(K)), so it has just 3 sections of size K, and
> every time the number of sections needs to be increased because (# of
> compactions) > 2^(# of sections), we decrease K by a factor of sqrt(2) for
> that compactor and double the number of sections.
>
> So, it seems to me that in the implementation, the relation between K and
> \varepsilon is quite clear, up to a constant factor :-) The relation of K
> and epsilon is only complicated in the paper, where we involve SL in the
> definition of K.
>
> Pavel
>
> PS: I agree that we should have more samples of the error distribution
> towards the tail that we care about
>
> On 2020-09-14 14:32, Justin Thaler wrote:
>
> > that's fine, but 64 points over the entire range means we haven't
> actually confirmed that it's working properly as you approach the tail.
> It's not sampled well enough at the tails to observe that behavior.
>
> I'll second that the primary motivation behind studying relative-error
> quantiles is to get really good accuracy in the tail. So while we should,
> of course, carefully characterize the error behavior over the whole range
> of ranks, a key component of the characterization should be to understand
> in practice how the error is behaving in the tail. This may require
> significantly more fine-grained characterization testing in the tail than
> we are used to.
>
> In particular, a lot of the ongoing discussion may just stem from the fact
> that the quantiles being reported are just the integer multiples of 1/64.
> My initial reaction is that the benefit of the relative-error sketch will
> be more apparent if the characterization tests cut up the interval [0,
> 1/64] much more finely.
>
> For example, the theory guarantees that for Low-Rank-Accuracy, items of
> (absolute) rank 0, 1, ..., 1/\varepsilon have exact counts, items of rank
> (1/\varepsilon)+1, (1/\varepsilon)+2, ..., 2\varepsilon have absolute error
> at most 1, and so on.
>
> In terms of relative ranks (normalized by stream length SL), each interval
> of 1/\varepsilon items referenced above has length just \varepsilon/SL,
> i.e., we only guarantee exact counts for items of relative rank [0,
> \varepsilon/SL].
>
> If \varepsilon is, say, 0.01, and SL is 2^{17}, then we are talking about
> intervals of length less than 0.001. This means that I am not surprised to
> see that we do not get _exact_ counts out of the sketch in practice even in
> the tail if we are effectively defining the tail to be all items of
> relative rank [0, 1/64]~=[0, 0.015].
>
> It's a bit tricky to try to figure out what setting of \varepsilon is
> (implicitly) being used in the current characterization code since the
> parameter provided is K as opposed to \varepsilon.
>
> I think K here means each buffer of size ~2B is viewed by the algorithm as
> consisting of 2B/K blocks of length K. In which case, if I recall
> correctly, the paper wants to set K to
> Theta((1/\varepsilon)/\sqrt{log_2(\varepsilon *SL)}) (with complications
> arising because SL is not known in advance, and with sizeable constants
> hidden by the Big-Theta-notation), see Equation 16 of
> https://arxiv.org/pdf/2004.01668.pdf. So in principle, once you tell me
> SL, I can probably derive the value of \varepsilon corresponding to any
> value of K, but this is complicated in practice since the expression
> defining K in terms of \varepsilon and SL is complicated.
>
> This is not any sort of criticism of the information that Lee has provided
> :-). It is just an inherently annoying thing to relate K to the \varepsilon
> parameter appearing in the theoretical analysis of the algorithm.
>
>
> On Mon, Sep 14, 2020 at 3:19 AM Jon Malkin <jo...@gmail.com> wrote:
>
>> If you want to also test the whole range, that's fine, but 64 points over
>> the entire range means we haven't actually confirmed that it's working
>> properly as you approach the tail. It's not sampled well enough at the
>> tails to observe that behavior.
>>
>> Ok, first off, this is what the original quantiles sketch does to solve
>> equations going above 1.0/below 0.0:
>>
>> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
>>
>> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
>> In other words, we just clip the result to stay within the known range.
>> Because the raw formula doesn't care about such limits and will happily
>> return something outside that range otherwise.
>>
>> And our existing bounds on the classic quantiles sketch are too
>> pessimistic most of the time. I mean take this simple code example:
>>
>> public void rankErrorTest() {
>>   int n = 1 << 20;  int k = 1 << 5;  UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();  for (int i = 0; i < n-1; i++)
>>     sketch.update(rand.nextGaussian());  System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false));  sketch.update(rand.nextGaussian());  System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false));}
>>
>> I insert n-1 items, then print retained items and rank error. Then I
>> insert 1 more and do the same. The output is:
>> Retained: 511 NormRakErr: 0.05415609535374165
>> Retained: 32 NormRakErr: 0.05415609535374165
>> And that makes sense since our rank error function doesn't take into
>> account n at all. As a result, we have no better an error estimate when
>> retaining 511 items compared to 32. The error bound is clearly being overly
>> conservative, not taking advantage of how many items we're actually
>> retaining. It's the 99th percentile plot, but we see this in our
>> documentation here:
>> http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png
>>
>> With the error guarantee in the paper of |\hat{R}(y) - R(y)| <=
>> \varepsilon R(y), the linear bounds seems correct. In fact, anything with
>> error increasing exponentially seems suspiciously odd. But perhaps I didn't
>> think through the details of the rank definition enough.
>>
>> Anyway, maybe I'm just confused by what exactly you mean by "predict the
>> actual error behavior"?
>>
>>   jon
>>
>>
>> On Sun, Sep 13, 2020, 10:02 PM leerho <le...@gmail.com> wrote:
>>
>>> I disagree.  The bounds on our current sketches do a pretty good job of
>>> predicting the actual error behavior.  That is why we have gone to the
>>> trouble of allowing the user to choose the size of the confidence interval
>>> and in the case of the theta Sketches we actually track the growth of the
>>> error Contour in the early stages of Estimation mode.   The better we can
>>> specify the error, the more confident the user will be in the result.
>>>
>>> For these early characterization tests it is critical to understand the
>>> error behavior over the entire range of ranks.  How we ultimately specify
>>> and document how to properly use this sketch we will do later.
>>>
>>> Lee.
>>>
>>>
>>> On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <jo...@gmail.com> wrote:
>>>
>>>> Our existing bounds, especially for the original quantities sketch,
>>>> don't necessarily predict the actual error behavior. They're an upper bound
>>>> on it. Especially right before one of the major cascading compaction
>>>> rounds, our error may be substantially better than the bound.
>>>>
>>>> I also feel like characterization plots going from 0.0-1.0 kind of
>>>> misses the point of the relative error quantiles sketch. It's kind of
>>>> useful to know what it's doing everywhere but if you care about that whole
>>>> range you should be using KLL, not this sketch. If you care about accuracy
>>>> over the whole range and also high precision at the tail, you should be
>>>> using both sketches together. The interesting thing with this sketch would
>>>> be to focus only on error in, say, the outer 5% at most. Maybe even a log
>>>> scale plot approaching the edge of the range.
>>>>
>>>>   jon
>>>>
>>> --
>>> From my cell phone.
>>>
>>
>

Re: Java code for Relative Error Quantiles Sketch

Posted by "Vesely, Pavel" <Pa...@warwick.ac.uk>.

Hi Justin,
regarding K and epsilon: K is actually an initial setting of section size and it should be equal to Theta(1 / epsilon), regardless of the stream length SL. This is because SL is not known in advance and we maintain the correct section size dynamically. That is, each compactor starts assuming that SL is small (i.e. SL = O(K)), so it has just 3 sections of size K, and every time the number of sections needs to be increased because (# of compactions) > 2^(# of sections), we decrease K by a factor of sqrt(2) for that compactor and double the number of sections.

So, it seems to me that in the implementation, the relation between K and \varepsilon is quite clear, up to a constant factor :-) The relation of K and epsilon is only complicated in the paper, where we involve SL in the definition of K.

Pavel

PS: I agree that we should have more samples of the error distribution towards the tail that we care about

On 2020-09-14 14:32, Justin Thaler wrote:
> that's fine, but 64 points over the entire range means we haven't actually confirmed that it's working properly as you approach the tail. It's not sampled well enough at the tails to observe that behavior.

I'll second that the primary motivation behind studying relative-error quantiles is to get really good accuracy in the tail. So while we should, of course, carefully characterize the error behavior over the whole range of ranks, a key component of the characterization should be to understand in practice how the error is behaving in the tail. This may require significantly more fine-grained characterization testing in the tail than we are used to.

In particular, a lot of the ongoing discussion may just stem from the fact that the quantiles being reported are just the integer multiples of 1/64. My initial reaction is that the benefit of the relative-error sketch will be more apparent if the characterization tests cut up the interval [0, 1/64] much more finely.

For example, the theory guarantees that for Low-Rank-Accuracy, items of (absolute) rank 0, 1, ..., 1/\varepsilon have exact counts, items of rank (1/\varepsilon)+1, (1/\varepsilon)+2, ..., 2\varepsilon have absolute error at most 1, and so on.

In terms of relative ranks (normalized by stream length SL), each interval of 1/\varepsilon items referenced above has length just \varepsilon/SL, i.e., we only guarantee exact counts for items of relative rank [0, \varepsilon/SL].

If \varepsilon is, say, 0.01, and SL is 2^{17}, then we are talking about intervals of length less than 0.001. This means that I am not surprised to see that we do not get _exact_ counts out of the sketch in practice even in the tail if we are effectively defining the tail to be all items of relative rank [0, 1/64]~=[0, 0.015].

It's a bit tricky to try to figure out what setting of \varepsilon is (implicitly) being used in the current characterization code since the parameter provided is K as opposed to \varepsilon.

I think K here means each buffer of size ~2B is viewed by the algorithm as consisting of 2B/K blocks of length K. In which case, if I recall correctly, the paper wants to set K to Theta((1/\varepsilon)/\sqrt{log_2(\varepsilon *SL)}) (with complications arising because SL is not known in advance, and with sizeable constants hidden by the Big-Theta-notation), see Equation 16 of https://arxiv.org/pdf/2004.01668.pdf. So in principle, once you tell me SL, I can probably derive the value of \varepsilon corresponding to any value of K, but this is complicated in practice since the expression defining K in terms of \varepsilon and SL is complicated.

This is not any sort of criticism of the information that Lee has provided :-). It is just an inherently annoying thing to relate K to the \varepsilon parameter appearing in the theoretical analysis of the algorithm.


On Mon, Sep 14, 2020 at 3:19 AM Jon Malkin <jo...@gmail.com>> wrote:
If you want to also test the whole range, that's fine, but 64 points over the entire range means we haven't actually confirmed that it's working properly as you approach the tail. It's not sampled well enough at the tails to observe that behavior.

Ok, first off, this is what the original quantiles sketch does to solve equations going above 1.0/below 0.0:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
In other words, we just clip the result to stay within the known range. Because the raw formula doesn't care about such limits and will happily return something outside that range otherwise.

And our existing bounds on the classic quantiles sketch are too pessimistic most of the time. I mean take this simple code example:

public void rankErrorTest() {
  int n = 1 << 20;
  int k = 1 << 5;

  UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();
  for (int i = 0; i < n-1; i++)
    sketch.update(rand.nextGaussian());
  System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false));

  sketch.update(rand.nextGaussian());
  System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false));
}

I insert n-1 items, then print retained items and rank error. Then I insert 1 more and do the same. The output is:
Retained: 511 NormRakErr: 0.05415609535374165
Retained: 32 NormRakErr: 0.05415609535374165
And that makes sense since our rank error function doesn't take into account n at all. As a result, we have no better an error estimate when retaining 511 items compared to 32. The error bound is clearly being overly conservative, not taking advantage of how many items we're actually retaining. It's the 99th percentile plot, but we see this in our documentation here: http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png

With the error guarantee in the paper of |\hat{R}(y) - R(y)| <= \varepsilon R(y), the linear bounds seems correct. In fact, anything with error increasing exponentially seems suspiciously odd. But perhaps I didn't think through the details of the rank definition enough.

Anyway, maybe I'm just confused by what exactly you mean by "predict the actual error behavior"?

  jon


On Sun, Sep 13, 2020, 10:02 PM leerho <le...@gmail.com>> wrote:
I disagree.  The bounds on our current sketches do a pretty good job of predicting the actual error behavior.  That is why we have gone to the trouble of allowing the user to choose the size of the confidence interval and in the case of the theta Sketches we actually track the growth of the error Contour in the early stages of Estimation mode.   The better we can specify the error, the more confident the user will be in the result.

For these early characterization tests it is critical to understand the error behavior over the entire range of ranks.  How we ultimately specify and document how to properly use this sketch we will do later.

Lee.


On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <jo...@gmail.com>> wrote:
Our existing bounds, especially for the original quantities sketch, don't necessarily predict the actual error behavior. They're an upper bound on it. Especially right before one of the major cascading compaction rounds, our error may be substantially better than the bound.

I also feel like characterization plots going from 0.0-1.0 kind of misses the point of the relative error quantiles sketch. It's kind of useful to know what it's doing everywhere but if you care about that whole range you should be using KLL, not this sketch. If you care about accuracy over the whole range and also high precision at the tail, you should be using both sketches together. The interesting thing with this sketch would be to focus only on error in, say, the outer 5% at most. Maybe even a log scale plot approaching the edge of the range.

  jon
--
From my cell phone.

Re: Java code for Relative Error Quantiles Sketch

Posted by Justin Thaler <ju...@gmail.com>.

> that's fine, but 64 points over the entire range means we haven't
actually confirmed that it's working properly as you approach the tail.
It's not sampled well enough at the tails to observe that behavior.

I'll second that the primary motivation behind studying relative-error
quantiles is to get really good accuracy in the tail. So while we should,
of course, carefully characterize the error behavior over the whole range
of ranks, a key component of the characterization should be to understand
in practice how the error is behaving in the tail. This may require
significantly more fine-grained characterization testing in the tail than
we are used to.

In particular, a lot of the ongoing discussion may just stem from the fact
that the quantiles being reported are just the integer multiples of 1/64.
My initial reaction is that the benefit of the relative-error sketch will
be more apparent if the characterization tests cut up the interval [0,
1/64] much more finely.

For example, the theory guarantees that for Low-Rank-Accuracy, items of
(absolute) rank 0, 1, ..., 1/\varepsilon have exact counts, items of rank
(1/\varepsilon)+1, (1/\varepsilon)+2, ..., 2\varepsilon have absolute error
at most 1, and so on.

In terms of relative ranks (normalized by stream length SL), each interval
of 1/\varepsilon items referenced above has length just \varepsilon/SL,
i.e., we only guarantee exact counts for items of relative rank [0,
\varepsilon/SL].

If \varepsilon is, say, 0.01, and SL is 2^{17}, then we are talking about
intervals of length less than 0.001. This means that I am not surprised to
see that we do not get _exact_ counts out of the sketch in practice even in
the tail if we are effectively defining the tail to be all items of
relative rank [0, 1/64]~=[0, 0.015].

It's a bit tricky to try to figure out what setting of \varepsilon is
(implicitly) being used in the current characterization code since the
parameter provided is K as opposed to \varepsilon.

I think K here means each buffer of size ~2B is viewed by the algorithm as
consisting of 2B/K blocks of length K. In which case, if I recall
correctly, the paper wants to set K to
Theta((1/\varepsilon)/\sqrt{log_2(\varepsilon *SL)}) (with complications
arising because SL is not known in advance, and with sizeable constants
hidden by the Big-Theta-notation), see Equation 16 of
https://arxiv.org/pdf/2004.01668.pdf. So in principle, once you tell me SL,
I can probably derive the value of \varepsilon corresponding to any value
of K, but this is complicated in practice since the expression defining K
in terms of \varepsilon and SL is complicated.

This is not any sort of criticism of the information that Lee has provided
:-). It is just an inherently annoying thing to relate K to the \varepsilon
parameter appearing in the theoretical analysis of the algorithm.

On Mon, Sep 14, 2020 at 3:19 AM Jon Malkin <jo...@gmail.com> wrote:

> If you want to also test the whole range, that's fine, but 64 points over
> the entire range means we haven't actually confirmed that it's working
> properly as you approach the tail. It's not sampled well enough at the
> tails to observe that behavior.
>
> Ok, first off, this is what the original quantiles sketch does to solve
> equations going above 1.0/below 0.0:
>
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
>
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
> In other words, we just clip the result to stay within the known range.
> Because the raw formula doesn't care about such limits and will happily
> return something outside that range otherwise.
>
> And our existing bounds on the classic quantiles sketch are too
> pessimistic most of the time. I mean take this simple code example:
>
> public void rankErrorTest() {
>   int n = 1 << 20;
>   int k = 1 << 5;
>
>   UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();
>   for (int i = 0; i < n-1; i++)
>     sketch.update(rand.nextGaussian());
>   System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false));
>
>   sketch.update(rand.nextGaussian());
>   System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false));
> }
>
> I insert n-1 items, then print retained items and rank error. Then I
> insert 1 more and do the same. The output is:
> Retained: 511 NormRakErr: 0.05415609535374165
> Retained: 32 NormRakErr: 0.05415609535374165
> And that makes sense since our rank error function doesn't take into
> account n at all. As a result, we have no better an error estimate when
> retaining 511 items compared to 32. The error bound is clearly being overly
> conservative, not taking advantage of how many items we're actually
> retaining. It's the 99th percentile plot, but we see this in our
> documentation here:
> http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png
>
> With the error guarantee in the paper of |\hat{R}(y) - R(y)| <=
> \varepsilon R(y), the linear bounds seems correct. In fact, anything with
> error increasing exponentially seems suspiciously odd. But perhaps I didn't
> think through the details of the rank definition enough.
>
> Anyway, maybe I'm just confused by what exactly you mean by "predict the
> actual error behavior"?
>
>   jon
>
>
> On Sun, Sep 13, 2020, 10:02 PM leerho <le...@gmail.com> wrote:
>
>> I disagree.  The bounds on our current sketches do a pretty good job of
>> predicting the actual error behavior.  That is why we have gone to the
>> trouble of allowing the user to choose the size of the confidence interval
>> and in the case of the theta Sketches we actually track the growth of the
>> error Contour in the early stages of Estimation mode.   The better we can
>> specify the error, the more confident the user will be in the result.
>>
>> For these early characterization tests it is critical to understand the
>> error behavior over the entire range of ranks.  How we ultimately specify
>> and document how to properly use this sketch we will do later.
>>
>> Lee.
>>
>>
>> On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <jo...@gmail.com> wrote:
>>
>>> Our existing bounds, especially for the original quantities sketch,
>>> don't necessarily predict the actual error behavior. They're an upper bound
>>> on it. Especially right before one of the major cascading compaction
>>> rounds, our error may be substantially better than the bound.
>>>
>>> I also feel like characterization plots going from 0.0-1.0 kind of
>>> misses the point of the relative error quantiles sketch. It's kind of
>>> useful to know what it's doing everywhere but if you care about that whole
>>> range you should be using KLL, not this sketch. If you care about accuracy
>>> over the whole range and also high precision at the tail, you should be
>>> using both sketches together. The interesting thing with this sketch would
>>> be to focus only on error in, say, the outer 5% at most. Maybe even a log
>>> scale plot approaching the edge of the range.
>>>
>>>   jon
>>>
>> --
>> From my cell phone.
>>
>

Re: Java code for Relative Error Quantiles Sketch

Posted by Jon Malkin <jo...@gmail.com>.

If you want to also test the whole range, that's fine, but 64 points over
the entire range means we haven't actually confirmed that it's working
properly as you approach the tail. It's not sampled well enough at the
tails to observe that behavior.

Ok, first off, this is what the original quantiles sketch does to solve
equations going above 1.0/below 0.0:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
In other words, we just clip the result to stay within the known range.
Because the raw formula doesn't care about such limits and will happily
return something outside that range otherwise.

And our existing bounds on the classic quantiles sketch are too pessimistic
most of the time. I mean take this simple code example:

public void rankErrorTest() {
  int n = 1 << 20;
  int k = 1 << 5;

  UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();
  for (int i = 0; i < n-1; i++)
    sketch.update(rand.nextGaussian());
  System.err.println("Retained: " + sketch.getRetainedItems() +
"\tNormRakErr: " + sketch.getNormalizedRankError(false));

  sketch.update(rand.nextGaussian());
  System.err.println("Retained: " + sketch.getRetainedItems() +
"\tNormRakErr: " + sketch.getNormalizedRankError(false));
}

I insert n-1 items, then print retained items and rank error. Then I insert
1 more and do the same. The output is:
Retained: 511 NormRakErr: 0.05415609535374165
Retained: 32 NormRakErr: 0.05415609535374165
And that makes sense since our rank error function doesn't take into
account n at all. As a result, we have no better an error estimate when
retaining 511 items compared to 32. The error bound is clearly being overly
conservative, not taking advantage of how many items we're actually
retaining. It's the 99th percentile plot, but we see this in our
documentation here:
http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png

With the error guarantee in the paper of |\hat{R}(y) - R(y)| <= \varepsilon
R(y), the linear bounds seems correct. In fact, anything with error
increasing exponentially seems suspiciously odd. But perhaps I didn't think
through the details of the rank definition enough.

Anyway, maybe I'm just confused by what exactly you mean by "predict the
actual error behavior"?

  jon

On Sun, Sep 13, 2020, 10:02 PM leerho <le...@gmail.com> wrote:

> I disagree.  The bounds on our current sketches do a pretty good job of
> predicting the actual error behavior.  That is why we have gone to the
> trouble of allowing the user to choose the size of the confidence interval
> and in the case of the theta Sketches we actually track the growth of the
> error Contour in the early stages of Estimation mode.   The better we can
> specify the error, the more confident the user will be in the result.
>
> For these early characterization tests it is critical to understand the
> error behavior over the entire range of ranks.  How we ultimately specify
> and document how to properly use this sketch we will do later.
>
> Lee.
>
>
> On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <jo...@gmail.com> wrote:
>
>> Our existing bounds, especially for the original quantities sketch, don't
>> necessarily predict the actual error behavior. They're an upper bound on
>> it. Especially right before one of the major cascading compaction rounds,
>> our error may be substantially better than the bound.
>>
>> I also feel like characterization plots going from 0.0-1.0 kind of misses
>> the point of the relative error quantiles sketch. It's kind of useful to
>> know what it's doing everywhere but if you care about that whole range you
>> should be using KLL, not this sketch. If you care about accuracy over the
>> whole range and also high precision at the tail, you should be using both
>> sketches together. The interesting thing with this sketch would be to focus
>> only on error in, say, the outer 5% at most. Maybe even a log scale plot
>> approaching the edge of the range.
>>
>>   jon
>>
> --
> From my cell phone.
>

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

I disagree.  The bounds on our current sketches do a pretty good job of
predicting the actual error behavior.  That is why we have gone to the
trouble of allowing the user to choose the size of the confidence interval
and in the case of the theta Sketches we actually track the growth of the
error Contour in the early stages of Estimation mode.   The better we can
specify the error, the more confident the user will be in the result.

For these early characterization tests it is critical to understand the
error behavior over the entire range of ranks.  How we ultimately specify
and document how to properly use this sketch we will do later.

Lee.

On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <jo...@gmail.com> wrote:

> Our existing bounds, especially for the original quantities sketch, don't
> necessarily predict the actual error behavior. They're an upper bound on
> it. Especially right before one of the major cascading compaction rounds,
> our error may be substantially better than the bound.
>
> I also feel like characterization plots going from 0.0-1.0 kind of misses
> the point of the relative error quantiles sketch. It's kind of useful to
> know what it's doing everywhere but if you care about that whole range you
> should be using KLL, not this sketch. If you care about accuracy over the
> whole range and also high precision at the tail, you should be using both
> sketches together. The interesting thing with this sketch would be to focus
> only on error in, say, the outer 5% at most. Maybe even a log scale plot
> approaching the edge of the range.
>
>   jon
>
-- 
From my cell phone.

Re: Java code for Relative Error Quantiles Sketch

Posted by Jon Malkin <jo...@gmail.com>.

Our existing bounds, especially for the original quantities sketch, don't
necessarily predict the actual error behavior. They're an upper bound on
it. Especially right before one of the major cascading compaction rounds,
our error may be substantially better than the bound.

I also feel like characterization plots going from 0.0-1.0 kind of misses
the point of the relative error quantiles sketch. It's kind of useful to
know what it's doing everywhere but if you care about that whole range you
should be using KLL, not this sketch. If you care about accuracy over the
whole range and also high precision at the tail, you should be using both
sketches together. The interesting thing with this sketch would be to focus
only on error in, say, the outer 5% at most. Maybe even a log scale plot
approaching the edge of the range.

  jon

On Sun, Sep 13, 2020 at 1:10 PM leerho <le...@gmail.com> wrote:

> Pavel,
> Yes, I was doing something wrong! Sorry! I was confusing the predicted
> bounds on a rank vs the error shown in the graphs where the absolute rank
> is subtracted out just showing the difference.
>
> Nonetheless, we do need functions that do not go negative for small K.
>
> Lee.
>
> On Sun, Sep 13, 2020 at 12:50 PM leerho <le...@gmail.com> wrote:
>
>> Pavel,
>> I have looked very carefully at your p-code for getRankUB and getRankLB,
>> and they both produce nonsensical values.
>>
>>    - The getMaximumRSE(k), with K=4 (the smallest value of K) multiplied
>>    by 3 SD = 1.23, which is > 1.  It should never produce values > 1 for all
>>    possible legal values of K. This will cause the LB to go negative.
>>    - Given a K and SD, the function (1 +/- SD * RSE) is a constant.
>>    This multiplied by the rank will produce a linear increasing bounds from
>>    the lowest rank to the highest.  And this does not model the error
>>    properties for the HRA case which has decreasing error for the high ranks
>>    or the LRA case where the error is pretty flat for most of the range.
>>    - We need equations that not only predict the actual error behavior,
>>    but also for the lower bound, never produces values < 0 (for all legal
>>    values of K), which these equations will.
>>
>> Comments:
>>
>>    - This is not a bug, but a user design principle.  I don't think it
>>    is a good idea to have the user specify a value to obtain the rank error.
>>    We don't want to give the user the idea that the error bounds are related
>>    to values at all, only ranks.  I realize we can translate the given value
>>    to an estimated rank, but let's have the user do that and document to the
>>    user the caveats about doing that.
>>    - We have been training our users to become accustomed to specifying
>>    and obtaining ranks as normalized [0,1]. That is how our other quantile
>>    sketches work.  Most users think in terms of percentiles anyway so
>>    normalized ranks are pretty natural.
>>
>> Perhaps I am still doing something wrong?
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, Sep 12, 2020 at 11:26 AM leerho <le...@gmail.com> wrote:
>>
>>> Hi Pavel,
>>> I think we are in agreement.  I take it as a positive that you haven't
>>> found any flaw in the implementation of the relative error quantiles
>>> algorithm.   What we are discussing now is definitions in how to define
>>> rank especially with respect to HRA and LRA, and philosophically, what kind
>>> of error distribution as a function of rank will users want and be easiest
>>> to specify and explain.
>>>
>>> My intuition is that all plots should look roughly the same, up to
>>>> mirroring
>>>> along the Y axis.
>>>
>>>
>>> This statement is a little puzzling as it will only be true if we choose
>>> definitions of rank appropriately if the user selects LRA or HRA.  As my
>>> plots reveal, if we keep the definition of rank the same, switching  from
>>> HRA to LRA has dramatically different error distribution as a  f(rank).  I
>>> agree with your formulas for relative error, except that all of our ranks
>>> are already normalized by SL, so I would replace SL in your formulas by
>>> 1.0.
>>>
>>> I still want to add to these plots your a priori calculation of error to
>>> see where it lies compared to the measured error.
>>>
>>> I gather from your comments that what you had in mind when writing the
>>> paper was a rank error distribution that looks like plots 3 and 4 above and
>>> not plots 1 and 2.  However, I feel that the rank error distribution shown
>>> in plots 1 and 2 would be easier for users to understand.  We will probably
>>> leave the decision totally up to the user as to what they prefer, however,
>>> this will require more documentation to explain all the different error
>>> behaviors  :( .
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

Pavel,
Yes, I was doing something wrong! Sorry! I was confusing the predicted
bounds on a rank vs the error shown in the graphs where the absolute rank
is subtracted out just showing the difference.

Nonetheless, we do need functions that do not go negative for small K.

Lee.

On Sun, Sep 13, 2020 at 12:50 PM leerho <le...@gmail.com> wrote:

> Pavel,
> I have looked very carefully at your p-code for getRankUB and getRankLB,
> and they both produce nonsensical values.
>
>    - The getMaximumRSE(k), with K=4 (the smallest value of K) multiplied
>    by 3 SD = 1.23, which is > 1.  It should never produce values > 1 for all
>    possible legal values of K. This will cause the LB to go negative.
>    - Given a K and SD, the function (1 +/- SD * RSE) is a constant.  This
>    multiplied by the rank will produce a linear increasing bounds from the
>    lowest rank to the highest.  And this does not model the error properties
>    for the HRA case which has decreasing error for the high ranks or the LRA
>    case where the error is pretty flat for most of the range.
>    - We need equations that not only predict the actual error behavior,
>    but also for the lower bound, never produces values < 0 (for all legal
>    values of K), which these equations will.
>
> Comments:
>
>    - This is not a bug, but a user design principle.  I don't think it is
>    a good idea to have the user specify a value to obtain the rank error.  We
>    don't want to give the user the idea that the error bounds are related to
>    values at all, only ranks.  I realize we can translate the given value to
>    an estimated rank, but let's have the user do that and document to the user
>    the caveats about doing that.
>    - We have been training our users to become accustomed to specifying
>    and obtaining ranks as normalized [0,1]. That is how our other quantile
>    sketches work.  Most users think in terms of percentiles anyway so
>    normalized ranks are pretty natural.
>
> Perhaps I am still doing something wrong?
>
> Cheers,
>
> Lee.
>
> On Sat, Sep 12, 2020 at 11:26 AM leerho <le...@gmail.com> wrote:
>
>> Hi Pavel,
>> I think we are in agreement.  I take it as a positive that you haven't
>> found any flaw in the implementation of the relative error quantiles
>> algorithm.   What we are discussing now is definitions in how to define
>> rank especially with respect to HRA and LRA, and philosophically, what kind
>> of error distribution as a function of rank will users want and be easiest
>> to specify and explain.
>>
>> My intuition is that all plots should look roughly the same, up to
>>> mirroring
>>> along the Y axis.
>>
>>
>> This statement is a little puzzling as it will only be true if we choose
>> definitions of rank appropriately if the user selects LRA or HRA.  As my
>> plots reveal, if we keep the definition of rank the same, switching  from
>> HRA to LRA has dramatically different error distribution as a  f(rank).  I
>> agree with your formulas for relative error, except that all of our ranks
>> are already normalized by SL, so I would replace SL in your formulas by
>> 1.0.
>>
>> I still want to add to these plots your a priori calculation of error to
>> see where it lies compared to the measured error.
>>
>> I gather from your comments that what you had in mind when writing the
>> paper was a rank error distribution that looks like plots 3 and 4 above and
>> not plots 1 and 2.  However, I feel that the rank error distribution shown
>> in plots 1 and 2 would be easier for users to understand.  We will probably
>> leave the decision totally up to the user as to what they prefer, however,
>> this will require more documentation to explain all the different error
>> behaviors  :( .
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

Pavel,
I have looked very carefully at your p-code for getRankUB and getRankLB,
and they both produce nonsensical values.

   - The getMaximumRSE(k), with K=4 (the smallest value of K) multiplied by
   3 SD = 1.23, which is > 1.  It should never produce values > 1 for all
   possible legal values of K. This will cause the LB to go negative.
   - Given a K and SD, the function (1 +/- SD * RSE) is a constant.  This
   multiplied by the rank will produce a linear increasing bounds from the
   lowest rank to the highest.  And this does not model the error properties
   for the HRA case which has decreasing error for the high ranks or the LRA
   case where the error is pretty flat for most of the range.
   - We need equations that not only predict the actual error behavior, but
   also for the lower bound, never produces values < 0 (for all legal values
   of K), which these equations will.

Comments:

   - This is not a bug, but a user design principle.  I don't think it is a
   good idea to have the user specify a value to obtain the rank error.  We
   don't want to give the user the idea that the error bounds are related to
   values at all, only ranks.  I realize we can translate the given value to
   an estimated rank, but let's have the user do that and document to the user
   the caveats about doing that.
   - We have been training our users to become accustomed to specifying and
   obtaining ranks as normalized [0,1]. That is how our other quantile
   sketches work.  Most users think in terms of percentiles anyway so
   normalized ranks are pretty natural.

Perhaps I am still doing something wrong?

Cheers,

Lee.

On Sat, Sep 12, 2020 at 11:26 AM leerho <le...@gmail.com> wrote:

> Hi Pavel,
> I think we are in agreement.  I take it as a positive that you haven't
> found any flaw in the implementation of the relative error quantiles
> algorithm.   What we are discussing now is definitions in how to define
> rank especially with respect to HRA and LRA, and philosophically, what kind
> of error distribution as a function of rank will users want and be easiest
> to specify and explain.
>
> My intuition is that all plots should look roughly the same, up to
>> mirroring
>> along the Y axis.
>
>
> This statement is a little puzzling as it will only be true if we choose
> definitions of rank appropriately if the user selects LRA or HRA.  As my
> plots reveal, if we keep the definition of rank the same, switching  from
> HRA to LRA has dramatically different error distribution as a  f(rank).  I
> agree with your formulas for relative error, except that all of our ranks
> are already normalized by SL, so I would replace SL in your formulas by
> 1.0.
>
> I still want to add to these plots your a priori calculation of error to
> see where it lies compared to the measured error.
>
> I gather from your comments that what you had in mind when writing the
> paper was a rank error distribution that looks like plots 3 and 4 above and
> not plots 1 and 2.  However, I feel that the rank error distribution shown
> in plots 1 and 2 would be easier for users to understand.  We will probably
> leave the decision totally up to the user as to what they prefer, however,
> this will require more documentation to explain all the different error
> behaviors  :( .
>
> Lee.
>
>
>
>
>
>
>
>

Re: Java code for Relative Error Quantiles Sketch

Posted by "Vesely, Pavel" <Pa...@warwick.ac.uk>.

Hi Lee,
regarding how to depict the error distribution in a way that's easy to explain and shows differences between LRA and HRA and relative and uniform guarantees:
I suggest to simply use re[i] = r[i] - tr[i]. This is actually a uniform error, and for our sketch the plots should roughly look like a linear function (decreasing for HRA and increasing for LRA). So, the plots shouldn't look like any of the four plots you sent on Friday.
Cheers,
Pavel

On 2020-09-12 20:26, leerho wrote:
Hi Pavel,
I think we are in agreement. I take it as a positive that you haven't found any flaw in the implementation of the relative error quantiles algorithm. What we are discussing now is definitions in how to define rank especially with respect to HRA and LRA, and philosophically, what kind of error distribution as a function of rank will users want and be easiest to specify and explain.

My intuition is that all plots should look roughly the same, up to mirroring
along the Y axis.

I still want to add to these plots your a priori calculation of error to see where it lies compared to the measured error.

Lee.

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

Hi Pavel,
I think we are in agreement.  I take it as a positive that you haven't
found any flaw in the implementation of the relative error quantiles
algorithm.   What we are discussing now is definitions in how to define
rank especially with respect to HRA and LRA, and philosophically, what kind
of error distribution as a function of rank will users want and be easiest
to specify and explain.

My intuition is that all plots should look roughly the same, up to mirroring
> along the Y axis.


This statement is a little puzzling as it will only be true if we choose
definitions of rank appropriately if the user selects LRA or HRA.  As my
plots reveal, if we keep the definition of rank the same, switching  from
HRA to LRA has dramatically different error distribution as a  f(rank).  I
agree with your formulas for relative error, except that all of our ranks
are already normalized by SL, so I would replace SL in your formulas by
1.0.

I still want to add to these plots your a priori calculation of error to
see where it lies compared to the measured error.

I gather from your comments that what you had in mind when writing the
paper was a rank error distribution that looks like plots 3 and 4 above and
not plots 1 and 2.  However, I feel that the rank error distribution shown
in plots 1 and 2 would be easier for users to understand.  We will probably
leave the decision totally up to the user as to what they prefer, however,
this will require more documentation to explain all the different error
behaviors  :( .

Lee.

Re: Java code for Relative Error Quantiles Sketch

Posted by Pavel Vesely <Pa...@warwick.ac.uk>.

Hi Lee,

my intuition is that all plots should look roughly the same, up to mirroring 
along the Y axis. In particular, the LRA case with ranks defined by >= should be 
the same as the HRA case with <= ranks. (Actually, does "LRA with >= ranks" mean 
that we want to be more precise for low values that however have high ranks due 
to the changed definition of rank?)

I actually think that the behavior for LRA (with <= or < ranks) is correct as 
your definition of the relative error is correct. In other words, the curves 
should be approx. flat, but of course, there's 0 error for very low ranks with 
probability 1 and the maximum value is also correct deterministically. So, the 
LRA plot is not a big surprise for me :-)

For HRA, I suggest to use re[i] = (r[i] - tr[i]) / (SL - tr[i]) and then you get 
the same behavior as for LRA (only mirrored). I also think you need to do the 
same modification of re[i] for LRA with >= ranks. So, my suggestion for defining 
re[i] is:
- re[i] = r[i] / tr[i] - 1 -- for LRA with <= ranks and for HRA with >= ranks,
- (r[i] - tr[i]) / (SL - tr[i]) -- for HRA with <= ranks and for LRA with >= ranks,

Does it make sense to you?

Cheers,
Pavel

On 12.09.2020 at 3:24 leerho wrote:
> Folks,
> Here are some of the first characterization results.
>
> Test Design:
>
>   * Stream Lengths (SL), 2^11, 2^14, 2^17, 2^20
>   * Stream values, natural numbers from 1 to SL, expressed as floats (32 bits).
>   * X-axis Trial Points: 64 evenly spaced values between 0 and SL, where the
>     top value is always SL. (tp[])
>   * Generate the true ranks of the 64 values: (tr[])
>   * Num Trials: 2^12 = 4096.
>   * Run the 4096 Trials; For each Trial:
>       o Shuffle the stream
>       o Feed the stream into the ReqSketch (sk)
>       o perform a bulk query: r[] = sk.getRanks(tp[]);
>       o for each rank compute the relative error: re[i] = r[i] / tr[i] - 1;
>       o feed the re[] into an array of 64 error quantiles sketches (KLL or the
>         older Quantiles sketch):  Qerr[].
>   * End of Trials
>  *
>
>   * From each of the 64 Qerr[] do getQuantiles(G[]), where G[] is an array of
>     6 Gaussian ranks corresponding to +/-1SD, +/-2SD, Median, and 1.0.  These
>     values are plotted for all 64 tr[] values along the x-axis.
>
> First I configured the ReqSketch with High Rank Accuracy, ">=" comparison 
> criteria, and K=50. It produced the following plot:
> ReqErrorK50SL11HRA.png
> This is actually what I expected! a very smooth transition from the high error 
> of the low ranks to the excellent error for the high ranks.  For the very low 
> ranks the error shoots off ultimately to infinity as those values may have 
> disappeared from the sketch.
> The 3 curves above zero represent +1 sd, +2 sd, and 1.0. The 2 curves below 
> are -1 sd and -2 sd.
>
> What is nice is that +/-2 sd represents a 95.4% CI and that starts at ranks 
> 0.2 and above.  Running this same test with 2^20 values produces the next 
> plot, which behaves very much the same way:
> ReqErrorK50SL20HRA.png
> Next I ran the tests with Low Rank Accuracy, everything else is the same.  
> Note the vertical scale is different!
> This at first was a big surprise:
> ReqErrorK50SL11LRA.png
> But after thinking about it I realized that this is due to the way we have 
> defined rank as <= (or <).  If we define rank as >= (or >), we should get the 
> similar shape as the earlier plots.  Running the LRA test with SL=2^17 (I 
> didn't want to wait an hour for the 2^20 plot!) generated the following:
> ReqErrorK50SL17LRA.png
> Note that the error doesn't drop to zero except for the +/-1sd contours.  This 
> is due to the fact that with SL=2^17, the first of 64 evenly placed points is 
> larger than the number of values that are never promoted.
>
> I'm not sure what was intended, but if these results are correct, this 
> behavior for the LRA is not very desirable.  The Confidence intervals are 
> almost flat (actually getting smaller at the high ranks, which is counter 
> intuitive) and the only region that is truly "highly accurate" is a very tiny 
> range at the very low end and there is a very sudden drop of accuracy into 
> that range.  Yes it is true that the error is better than 1% over the entire 
> range.  But I think in practice, this behavior might be problematic.
>
> What I wonder is whether we should be defining rank for the LRA case to be 
> ">=" (or ">") ?  Either way, we would need to define our error behavior very 
> carefully!
>
> Cheers,
>
> Lee.
>
>
>
>
>
>
>
> On Tue, Sep 8, 2020 at 5:19 PM leerho <leerho@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     I am making good progress on the new Relative Error Quantiles
>     <https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles>
>     sketch.  This is in the "RelativeErrorQuantiles" branch, not "master". It
>     is not ready to be checked in yet, but it is getting close.  It would be
>     wonderful if someone could give it a whirl and/or take a look at the code. :)
>
>     Here is what is implemented so far:
>
>       * At sketch construction, it can be configured to prioritize accuracy
>         for either the high ranks or the low ranks.
>       * The default item comparison criterion is "<", which is consistent with
>         our other quantile sketches. However, the user can change the
>         criterion to "<=" by simply calling /setLtEq( true ). /This will allow
>         comparison with other quantile functions that may use that criterion.
>       * I implemented extensive debug and "pretty printing" functions, which
>         not only are useful for debugging, but should help anyone visually see
>         what is going on internally.  (If anyone is interested in seeing this
>         in action, let me know, and I'll send out some instructions  :)   ). 
>         I developed a custom "signaling" interface to minimize clutter in the
>         main code yet allow for deep analysis of what the code is doing.
>       * I custom crafted two versions of a /O(n) /merge sort depending on the
>         setting of highRankAccuracy (true/false).  Merge sorting is used
>         wherever possible.
>       * I also custom crafted a binary search algorithm to automatically
>         accommodate the two comparison criteria.
>       * I think most of the javadocs are complete, but I'm sure there are
>         always ways to improve the documentation.
>       * All the license headers are in place
>       * Unit test coverage is hovering around 98% (according to my Clover).
>
>     What remains to be completed:
>
>       * The design of the binary storage format
>       * Serialization / Deserialization.
>       * Comprehensive Characterization of accuracy, speed and stored size as
>         functions of /k/ and /N./
>       * I could really use some help with code reviews and feedback!  :) :)
>
>
>     If anyone needs some help in getting it booted up so you can play with it,
>     let me know.  Sorry, this is not Python, it does take a few more steps :)
>
>     Also, if any of you are not signed up for our /dev@/ list just send a
>     blank email to dev-subscribe@datasketches.apache.org
>     <ma...@datasketches.apache.org>. The traffic is really low
>     so It won't swamp your inbox!
>
>     Cheers,
>
>     Lee.
>
>     On Mon, Sep 7, 2020 at 5:57 AM Pavel Vesely <Pavel.Vesely@warwick.ac.uk
>     <ma...@warwick.ac.uk>> wrote:
>
>         Hi Lee,
>
>         okay, let's talk on Thursday. I'll look at your code by then.
>
>         Cheers,
>         Pavel
>
>         On 07.09.2020 at 3:40 leerho wrote:
>         > Yes, let's meet on Thursday 9:00AM PDT (UTC -7).
>         >
>         > 1. I am uncomfortable with changing the getMaximumRSE(k) method by a
>         factor of
>         > 8 ( you suggested 1 or .9) without some rigorous testing or a
>         proof.   ( We
>         > would lose our ability to say that our bounds are mathematically
>         proven :) )
>         > We should discuss this.
>         >
>         > 2. Just as a comment, we always return ranks as normalized ranks
>         [0,1] not as
>         > counts.  Not a big deal, but when you look at my code you will need
>         to know
>         > this  :)
>         >
>         > Please take a look at my code at:
>         >
>         https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles
>         >
>         > Cheers,
>         >
>         > Lee.
>         >
>         >
>         >
>         > On Thu, Sep 3, 2020 at 7:40 AM Vesely, Pavel
>         <Pavel.Vesely@warwick.ac.uk <ma...@warwick.ac.uk>
>         > <mailto:Pavel.Vesely@warwick.ac.uk
>         <ma...@warwick.ac.uk>>> wrote:
>         >
>         >     Hi Lee,
>         >
>         >     I finally got back to you. Regarding your summary below, it's
>         right that
>         >     we don't have a mathematical analysis supporting Strategy 4, and
>         there
>         >     might be none (although constructing a specific example might be
>         tricky).
>         >
>         >     I've added the remaining functions that you need, namely:
>         >
>         >     - getMaximumRSE(k): Returns an a priori estimate of relative
>         standard
>         >     error (RSE, expressed as a number in [0,1]), calculated as
>         sqrt(Var) /
>         >     rank (note that neither n, nor rank are needed). I'm using a
>         formula for
>         >     Var from the paper, but it seems too pessimistic (by a factor of 3),
>         >     according to some experiments that I've done.
>         >
>         >     - getUpperBound(numStdDev) and getLowerBound(numStdDev) that
>         provide a
>         >     confidence interval by adding/subtracting a specified multiple
>         of stdDev
>         >     from the estimate
>         >
>         >     Let me know if you have any questions. I can also join a call
>         next week
>         >     (Wednesday or Thursday would be the best).
>         >
>         >     Cheers,
>         >     Pavel
>         >
>         >
>         >     On 26.08.2020 at 21:18 leerho wrote:
>         >>     Hi Pavel,
>         >>     Thanks for your analysis.  My summary of your analysis:
>         >>
>         >>       * Strategy 1: This would work but performs too many
>         compactions. It has
>         >>         the simplest prediction of sketch size. The math obviously
>         works so
>         >>         a priori prediction of sketch error should be relatively
>         straightforward.
>         >>       * Strategy 2: The math works and has far fewer compactions than
>         >>         Strategy1. You didn't comment on the difficulty in a priori
>         >>         prediction of sketch size.  Because it is anticipated in
>         the paper,
>         >>         the a priori prediction of sketch error should also be
>         straightforward.
>         >>       * Strategy 3:  Comment: your restatement of Region 1 is
>         equivalent to
>         >>         what I called Region 1 and Region 2. I only separated out
>         Region 2 to
>         >>         emphasize that that is the size of the overflow. Major
>         >>         disadvantages are: the math analysis doesn't support this
>         strategy,
>         >>         it would have as many compactions as Strategy 1, and may be
>         >>         vulnerable to pathological sequences.  So Strategy 3 is
>         definitely out!
>         >>       * Strategy 4: (If I understood your comments...) the math
>         does not
>         >>         support this strategy.  So let's leave this one out as well.
>         >>
>         >>     That leaves only Strategy 2.  Which I will modify the java code to
>         >>     implement.  This certainly simplifies testing which I am all in
>         favor of!
>         >>
>         >>     Lee.
>         >>
>         >>
>         >>     On Wed, Aug 26, 2020 at 9:22 AM Pavel Vesely
>         <Pavel.Vesely@warwick.ac.uk <ma...@warwick.ac.uk>
>         >>     <mailto:Pavel.Vesely@warwick.ac.uk
>         <ma...@warwick.ac.uk>>> wrote:
>         >>
>         >>         Hi Lee!
>         >>
>         >>         Thanks for sending the links. I've done some renaming
>         according to you
>         >>         suggestions and added two new functions to my code:
>         >>
>         >>         - quantile(q): given quantile query q in [0,1], it returns
>         an approx.
>         >>         q-quantile
>         >>
>         >>         - getMaxStoredItems(k, n): this is a static method that
>         estimates the
>         >>         (max.
>         >>         nominal) size of the sketch with parameter k on input of
>         size n. It
>         >>         seems to be
>         >>         correct up to a 10% error (at least, when the input size
>         and k are
>         >>         sufficiently
>         >>         large, such as k >= 16). I believe a more precise function
>         would need to
>         >>         simulate constructing the whole sketch ...
>         >>
>         >>         Regarding the strategies, I have a few comments (some of
>         them I said
>         >>         during the
>         >>         call):
>         >>
>         >>         *Strategy 1:* Indeed, this strategy performs unnecessarily many
>         >>         compaction
>         >>         operations. The only advantage is an easy predictability of
>         the final
>         >>         size of
>         >>         the sketch, i.e., possibly a more precise getMaxStoredItems
>         function.
>         >>
>         >>         *Strategy 2:* its advantage is that our mathematical
>         analysis works
>         >>         for sure
>         >>         (strictly speaking, the proof works for Strategy 1, but
>         there are no
>         >>         issues
>         >>         introduced by laziness of Strategy 2, and actually, I think
>         the more
>         >>         general
>         >>         analysis of Appendix D applies to Strategy 2 directly).
>         >>
>         >>         *Strategy 3:* I remember it slightly differently:
>         >>
>         >>         - Region 1 has Length - numSection * sectionSize items and
>         is never
>         >>         compacted;
>         >>         it possibly contains some part of the "overflow" part
>         (above nomCap)
>         >>         - Region 2 has numSections * sectionSize largest items,
>         split into
>         >>         sections of
>         >>         size sectionSize, and we choose some number of sections to
>         compact
>         >>
>         >>         This has the disadvantage of compacting just a few items when
>         >>         sectionSize is
>         >>         small, leading to a larger number of compations (similarly
>         as for
>         >>         Strategy 1).
>         >>         Furthermore, our mathematical analysis does not go through
>         directly
>         >>         and there
>         >>         may be some (contrived) adversarial instances on which it
>         has a large
>         >>         error.
>         >>
>         >>         The advantage is that we protect (i.e., not compact) more
>         items than
>         >>         in the
>         >>         previous strategies when the buffer overflow is large.
>         >>
>         >>         *Strategy 4:* I think Region 1 has size Length / 2, instead
>         of nomCap
>         >>         / 2. The
>         >>         advantage is that when the buffer overflow is large we
>         compact more
>         >>         than in
>         >>         Strategy 3 (although not as many items as in Strategy 2),
>         while we
>         >>         also protect
>         >>         more items than in Strategy 2.
>         >>
>         >>         The disadvantage is that getMaxStoredItems function would
>         be least
>         >>         precise
>         >>         because sectionSize varies over time (at least, that's my
>         intuition).
>         >>         As for
>         >>         Strategy 3, our mathematical analysis does not go through
>         directly
>         >>         and there may
>         >>         be some adversarial instances on which it has a large
>         error, perhaps
>         >>         even more
>         >>         contrived that in the previous case.
>         >>
>         >>         So far, I have no clear recommendation, and haven't thought
>         much about
>         >>         calculating the confidence bounds (will get to it next
>         week). I would
>         >>         only
>         >>         exclude Strategy 1 and Strategy 3.
>         >>
>         >>         Best,
>         >>         Pavel
>         >>
>         >>         On 25.08.2020 at 23:40 leerho wrote:
>         >>         > I forgot to add the links:
>         >>         >
>         >>         > Main Code:
>         >>         >
>         >>
>         https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/main/java/org/apache/datasketches/req
>         >>         >
>         >>         > Test Code:
>         >>         >
>         >>
>         https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/test/java/org/apache/datasketches/req
>         >>         >
>         >>         > On Tue, Aug 25, 2020 at 2:10 PM leerho <leerho@gmail.com
>         <ma...@gmail.com>
>         >>         <mailto:leerho@gmail.com <ma...@gmail.com>>
>         >>         > <mailto:leerho@gmail.com <ma...@gmail.com>
>         <mailto:leerho@gmail.com <ma...@gmail.com>>>> wrote:
>         >>         >
>         >>         >     Folks,
>         >>         >
>         >>         >     Here are the links to the Java main code and the Java
>         test code
>         >>         for the
>         >>         >     new REQ sketch.  This code is in very initial state
>         and is in
>         >>         the branch
>         >>         >     "RelativeErrorQuantiles".  It has a lot of work left
>         before we
>         >>         merge it
>         >>         >     into master, so recognize it will be changing a lot.
>         >>         >
>         >>         >     Pavel and I had a long discussion about what the best
>         ( & most
>         >>         practical)
>         >>         >     compaction strategy would be. Whatever strategy we
>         choose, we
>         >>         need to be
>         >>         >     confident that we have mathematics that will support the
>         >>         implementation
>         >>         >     and that we will be able to obtain reasonable
>         confidence bounds
>         >>         on the
>         >>         >     estimates produced by the sketch.  All of these
>         strategies
>         >>         assume that the
>         >>         >     low ranks are the most accurate. (This of course can
>         be changed.)
>         >>         >
>         >>         >     *Strategy 1.* My understanding of the paper is that
>         it proposes
>         >>         this
>         >>         >     strategy:
>         >>         >     The compactor has a maximum size given by a Capacity
>         = 2 *
>         >>         numSections *
>         >>         >     sectionSize.  As soon as the compactor reaches
>         capacity, the
>         >>         compaction
>         >>         >     process starts.
>         >>         >
>         >>         >       * Region1: The bottom half, of size Capacity / 2,
>         and is
>         >>         never compacted.
>         >>         >       * Region2: The top half consists of /m/ sections of
>         size /k
>         >>         /(these can
>         >>         >         change as a f(numCompactions) but ignore for this
>         description)
>         >>         >           o The top-most of the m sections is always
>         compacted,
>         >>         leaving m-1
>         >>         >             sections to be potentially compacted.
>         >>         >           o Of the m-1 sections, choose the top m' of
>         these based
>         >>         on the
>         >>         >             distribution created from the number of
>         trailing ones
>         >>         in the
>         >>         >             number of compactions that have already
>         occurred in
>         >>         this compactor.
>         >>         >
>         >>         >     This will create the most number of compactions and
>         the most
>         >>         levels per
>         >>         >     value of N.
>         >>         >
>         >>         >     *Strategy 2.* This strategy is what Pavel suggested
>         in his
>         >>         Python code:
>         >>         >     The compactor has a Nominal Capacity and an actual
>         Length that
>         >>         can exceed
>         >>         >     the nomCap. The nomCap is defined as above, where
>         nomCap = 2 *
>         >>         numSections
>         >>         >     * sectionSize for each compactor. The sketch has a
>         maxSize =
>         >>         >     sum(compactor[i].nomCap() ). When this maxSize is
>         exceeded the
>         >>         >     compression process starts which chooses the lowest
>         numbered
>         >>         compactor
>         >>         >     that exceeds its nomCap to compact.  If this turns
>         out to be
>         >>         the top
>         >>         >     compactor, then a new compactor is added at the top.
>         >>         >
>         >>         >     By allowing a compactor to exceed its nomCap, we keep
>         more
>         >>         values at a
>         >>         >     lower level improving accuracy. But the issue is what
>         to do
>         >>         with the
>         >>         >     "overflow" items during compaction.  This is addressed
>         >>         differently by this
>         >>         >     strategy and the next two.
>         >>         >
>         >>         >       * This compactor has 3 regions:
>         >>         >           o Region1: of size nomCap / 2 is never compacted.
>         >>         >           o Region2: of size numSections * sectionsSize
>         >>         >           o Region3: of size Length - (Region1 +
>         Region2);  # this
>         >>         is the
>         >>         >             "overflow" and is always compacted.
>         >>         >       * In this strategy all of Region3 plus the randomly
>         chosen,
>         >>         top m'
>         >>         >         sections of Region 2 will be compacted.
>         >>         >
>         >>         >
>         >>         >     *Strategy 3.*  This strategy is what I have currently
>         >>         implemented in Java
>         >>         >     (and my version of Pavel's Python) (because I didn't
>         fully
>         >>         understand what
>         >>         >     Pavel had in mind :) ):
>         >>         >     This strategy is similar to Strategy 2, and also
>         allows for
>         >>         overflow. The
>         >>         >     way the regions are treated is different:
>         >>         >
>         >>         >       * This compactor also has 3 regions:
>         >>         >           o Region 1 of size nomCap / 2 is never compacted
>         >>         >           o Region 2 is the overflow region of size Length -
>         >>         nomCap. It is
>         >>         >             also not compacted.
>         >>         >           o Region 3 of size numSections * sectionsSize. 
>         The random m'
>         >>         >             sections will be compacted based on the
>         >>         f(numCompactions) as above.
>         >>         >
>         >>         >     This keeps more values behind and would maintain better
>         >>         accuracy longer,
>         >>         >     but perhaps causing more compactions than #2.
>         >>         >
>         >>         >     *Strategy 4.*  This strategy is similar to #2 and #3,
>         but again
>         >>         different
>         >>         >     in how the overflow values are treated:
>         >>         >
>         >>         >       * This compactor has 2 regions:
>         >>         >           o Region 1 of size nomCap / 2 is never compacted
>         >>         >           o Region 2 of size numSections * sectionSize',
>         where the
>         >>         >             sectionSize' is dynamically adjusted so that
>         numSections *
>         >>         >             sectionSize' consumes most of Length -
>         Region1. The
>         >>         number m' is
>         >>         >             chosen as before to determine what sections
>         to compact.
>         >>         >
>         >>         >     All of these strategies are implementable and have
>         different
>         >>         tradeoffs
>         >>         >     with respect to accuracy, size and speed.
>         >>         >
>         >>         >     *Questions:*
>         >>         >
>         >>         >     1. Which of these strategies can provide:
>         >>         >
>         >>         >       * An a priori calculation of the sketch size as a
>         f(k, N).
>         >>         >       * An a priori calculation of estimation error with
>         confidence
>         >>         intervals.
>         >>         >       * An a posteriori calculation of confidence
>         intervals based
>         >>         on the state
>         >>         >         of the sketch.
>         >>         >       * A defense that claims that the sketch
>         implementation is
>         >>         backed by
>         >>         >         mathematical proofs of correct operation with
>         provable
>         >>         error properties.
>         >>         >
>         >>         >     2.  What is your recommendation?
>         >>         >
>         >>         >     Note: I may implement all 4 strategies, so that we
>         can fully
>         >>         characterize
>         >>         >     them, but I would prefer if we could limit these
>         options to one
>         >>         (or two)
>         >>         >     to put into production.  :)
>         >>         >
>         >>         >     Your help with these questions would be appreciated.
>         >>         >
>         >>         >     Lee.
>         >>         >
>         >>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@datasketches.apache.org
For additional commands, e-mail: dev-help@datasketches.apache.org

Re: Java code for Relative Error Quantiles Sketch

Posted by leerho <le...@gmail.com>.

Folks,
Here are some of the first characterization results.

Test Design:

   - Stream Lengths (SL), 2^11, 2^14, 2^17, 2^20
   - Stream values, natural numbers from 1 to SL, expressed as floats (32
   bits).
   - X-axis Trial Points: 64 evenly spaced values between 0 and SL, where
   the top value is always SL. (tp[])
   - Generate the true ranks of the 64 values: (tr[])
   - Num Trials: 2^12 = 4096.
   - Run the 4096 Trials; For each Trial:
      - Shuffle the stream
      - Feed the stream into the ReqSketch (sk)
      - perform a bulk query: r[] = sk.getRanks(tp[]);
      - for each rank compute the relative error: re[i] = r[i] / tr[i] - 1;
      - feed the re[] into an array of 64 error quantiles sketches (KLL or
      the older Quantiles sketch):  Qerr[].
   - End of Trials
   -
   - From each of the 64 Qerr[] do getQuantiles(G[]), where G[] is an
   array of 6 Gaussian ranks corresponding to +/-1SD, +/-2SD, Median, and
   1.0.  These values are plotted for all 64 tr[] values along the x-axis.

First I configured the ReqSketch with High Rank Accuracy, ">=" comparison
criteria, and K=50. It produced the following plot:
[image: ReqErrorK50SL11HRA.png]
This is actually what I expected! a very smooth transition from the high
error of the low ranks to the excellent error for the high ranks.  For the
very low ranks the error shoots off ultimately to infinity as those values
may have disappeared from the sketch.
The 3 curves above zero represent +1 sd, +2 sd, and 1.0.  The 2 curves
below are -1 sd and -2 sd.

What is nice is that +/-2 sd represents a 95.4% CI and that starts at ranks
0.2 and above.  Running this same test with 2^20 values produces the next
plot, which behaves very much the same way:
[image: ReqErrorK50SL20HRA.png]
Next I ran the tests with Low Rank Accuracy, everything else is the same.
Note the vertical scale is different!
This at first was a big surprise:
[image: ReqErrorK50SL11LRA.png]
But after thinking about it I realized that this is due to the way we have
defined rank as <= (or <).  If we define rank as >= (or >), we should get
the similar shape as the earlier plots.  Running the LRA test with SL=2^17
(I didn't want to wait an hour for the 2^20 plot!) generated the following:
[image: ReqErrorK50SL17LRA.png]
Note that the error doesn't drop to zero except for the +/-1sd contours.
This is due to the fact that with SL=2^17, the first of 64 evenly placed
points is larger than the number of values that are never promoted.

I'm not sure what was intended, but if these results are correct, this
behavior for the LRA is not very desirable.  The Confidence intervals are
almost flat (actually getting smaller at the high ranks, which is counter
intuitive) and the only region that is truly "highly accurate" is a very
tiny range at the very low end and there is a very sudden drop of accuracy
into that range.  Yes it is true that the error is better than 1% over the
entire range.  But I think in practice, this behavior might be problematic.

What I wonder is whether we should be defining rank for the LRA case to be
">=" (or ">") ?  Either way, we would need to define our error behavior
very carefully!

Cheers,

Lee.







On Tue, Sep 8, 2020 at 5:19 PM leerho <le...@gmail.com> wrote:

> I am making good progress on the new Relative Error Quantiles
> <https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles>
> sketch.  This is in the "RelativeErrorQuantiles" branch, not "master". It
> is not ready to be checked in yet, but it is getting close.  It would be
> wonderful if someone could give it a whirl and/or take a look at the code.
> :)
>
> Here is what is implemented so far:
>
>    - At sketch construction, it can be configured to prioritize accuracy
>    for either the high ranks or the low ranks.
>    - The default item comparison criterion is "<", which is consistent
>    with our other quantile sketches. However, the user can change the
>    criterion to "<=" by simply calling *setLtEq( true ).  *This will
>    allow comparison with other quantile functions that may use that criterion.
>    - I implemented extensive debug and "pretty printing" functions, which
>    not only are useful for debugging, but should help anyone visually see what
>    is going on internally.  (If anyone is interested in seeing this in action,
>    let me know, and I'll send out some instructions  :)   ).  I developed a
>    custom "signaling" interface to minimize clutter in the main code yet allow
>    for deep analysis of what the code is doing.
>    - I custom crafted two versions of a *O(n) *merge sort depending on
>    the setting of highRankAccuracy (true/false).  Merge sorting is used
>    wherever possible.
>    - I also custom crafted a binary search algorithm to automatically
>    accommodate the two comparison criteria.
>    - I think most of the javadocs are complete, but I'm sure there are
>    always ways to improve the documentation.
>    - All the license headers are in place
>    - Unit test coverage is hovering around 98% (according to my Clover).
>
> What remains to be completed:
>
>    - The design of the binary storage format
>    - Serialization / Deserialization.
>    - Comprehensive Characterization of accuracy, speed and stored size as
>    functions of *k* and *N.*
>    - I could really use some help with code reviews and feedback!  :) :)
>
>
> If anyone needs some help in getting it booted up so you can play with it,
> let me know.  Sorry, this is not Python, it does take a few more steps :)
>
> Also, if any of you are not signed up for our *dev@* list just send a
> blank email to dev-subscribe@datasketches.apache.org.  The traffic is
> really low so It won't swamp your inbox!
>
> Cheers,
>
> Lee.
>
> On Mon, Sep 7, 2020 at 5:57 AM Pavel Vesely <Pa...@warwick.ac.uk>
> wrote:
>
>> Hi Lee,
>>
>> okay, let's talk on Thursday. I'll look at your code by then.
>>
>> Cheers,
>> Pavel
>>
>> On 07.09.2020 at 3:40 leerho wrote:
>> > Yes, let's meet on Thursday 9:00AM PDT (UTC -7).
>> >
>> > 1. I am uncomfortable with changing the getMaximumRSE(k) method by a
>> factor of
>> > 8 ( you suggested 1 or .9) without some rigorous testing or a proof.
>>  ( We
>> > would lose our ability to say that our bounds are mathematically proven
>> :) )
>> > We should discuss this.
>> >
>> > 2. Just as a comment, we always return ranks as normalized ranks [0,1]
>> not as
>> > counts.  Not a big deal, but when you look at my code you will need to
>> know
>> > this  :)
>> >
>> > Please take a look at my code at:
>> >
>> https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles
>> >
>> > Cheers,
>> >
>> > Lee.
>> >
>> >
>> >
>> > On Thu, Sep 3, 2020 at 7:40 AM Vesely, Pavel <
>> Pavel.Vesely@warwick.ac.uk
>> > <ma...@warwick.ac.uk>> wrote:
>> >
>> >     Hi Lee,
>> >
>> >     I finally got back to you. Regarding your summary below, it's right
>> that
>> >     we don't have a mathematical analysis supporting Strategy 4, and
>> there
>> >     might be none (although constructing a specific example might be
>> tricky).
>> >
>> >     I've added the remaining functions that you need, namely:
>> >
>> >     - getMaximumRSE(k): Returns an a priori estimate of relative
>> standard
>> >     error (RSE, expressed as a number in [0,1]), calculated as
>> sqrt(Var) /
>> >     rank (note that neither n, nor rank are needed). I'm using a
>> formula for
>> >     Var from the paper, but it seems too pessimistic (by a factor of 3),
>> >     according to some experiments that I've done.
>> >
>> >     - getUpperBound(numStdDev) and getLowerBound(numStdDev) that
>> provide a
>> >     confidence interval by adding/subtracting a specified multiple of
>> stdDev
>> >     from the estimate
>> >
>> >     Let me know if you have any questions. I can also join a call next
>> week
>> >     (Wednesday or Thursday would be the best).
>> >
>> >     Cheers,
>> >     Pavel
>> >
>> >
>> >     On 26.08.2020 at 21:18 leerho wrote:
>> >>     Hi Pavel,
>> >>     Thanks for your analysis.  My summary of your analysis:
>> >>
>> >>       * Strategy 1: This would work but performs too many compactions.
>> It has
>> >>         the simplest prediction of sketch size.  The math obviously
>> works so
>> >>         a priori prediction of sketch error should be relatively
>> straightforward.
>> >>       * Strategy 2: The math works and has far fewer compactions than
>> >>         Strategy1. You didn't comment on the difficulty in a priori
>> >>         prediction of sketch size.  Because it is anticipated in the
>> paper,
>> >>         the a priori prediction of sketch error should also be
>> straightforward.
>> >>       * Strategy 3:  Comment: your restatement of Region 1 is
>> equivalent to
>> >>         what I called Region 1 and Region 2. I only separated out
>> Region 2 to
>> >>         emphasize that that is the size of the overflow. Major
>> >>         disadvantages are: the math analysis doesn't support this
>> strategy,
>> >>         it would have as many compactions as Strategy 1, and may be
>> >>         vulnerable to pathological sequences.  So Strategy 3 is
>> definitely out!
>> >>       * Strategy 4: (If I understood your comments...) the math does
>> not
>> >>         support this strategy.  So let's leave this one out as well.
>> >>
>> >>     That leaves only Strategy 2.  Which I will modify the java code to
>> >>     implement.  This certainly simplifies testing which I am all in
>> favor of!
>> >>
>> >>     Lee.
>> >>
>> >>
>> >>     On Wed, Aug 26, 2020 at 9:22 AM Pavel Vesely <
>> Pavel.Vesely@warwick.ac.uk
>> >>     <ma...@warwick.ac.uk>> wrote:
>> >>
>> >>         Hi Lee!
>> >>
>> >>         Thanks for sending the links. I've done some renaming
>> according to you
>> >>         suggestions and added two new functions to my code:
>> >>
>> >>         - quantile(q): given quantile query q in [0,1], it returns an
>> approx.
>> >>         q-quantile
>> >>
>> >>         - getMaxStoredItems(k, n): this is a static method that
>> estimates the
>> >>         (max.
>> >>         nominal) size of the sketch with parameter k on input of size
>> n. It
>> >>         seems to be
>> >>         correct up to a 10% error (at least, when the input size and k
>> are
>> >>         sufficiently
>> >>         large, such as k >= 16). I believe a more precise function
>> would need to
>> >>         simulate constructing the whole sketch ...
>> >>
>> >>         Regarding the strategies, I have a few comments (some of them
>> I said
>> >>         during the
>> >>         call):
>> >>
>> >>         *Strategy 1:* Indeed, this strategy performs unnecessarily many
>> >>         compaction
>> >>         operations. The only advantage is an easy predictability of
>> the final
>> >>         size of
>> >>         the sketch, i.e., possibly a more precise getMaxStoredItems
>> function.
>> >>
>> >>         *Strategy 2:* its advantage is that our mathematical analysis
>> works
>> >>         for sure
>> >>         (strictly speaking, the proof works for Strategy 1, but there
>> are no
>> >>         issues
>> >>         introduced by laziness of Strategy 2, and actually, I think
>> the more
>> >>         general
>> >>         analysis of Appendix D applies to Strategy 2 directly).
>> >>
>> >>         *Strategy 3:* I remember it slightly differently:
>> >>
>> >>         - Region 1 has Length - numSection * sectionSize items and is
>> never
>> >>         compacted;
>> >>         it possibly contains some part of the "overflow" part (above
>> nomCap)
>> >>         - Region 2 has numSections * sectionSize largest items, split
>> into
>> >>         sections of
>> >>         size sectionSize, and we choose some number of sections to
>> compact
>> >>
>> >>         This has the disadvantage of compacting just a few items when
>> >>         sectionSize is
>> >>         small, leading to a larger number of compations (similarly as
>> for
>> >>         Strategy 1).
>> >>         Furthermore, our mathematical analysis does not go through
>> directly
>> >>         and there
>> >>         may be some (contrived) adversarial instances on which it has
>> a large
>> >>         error.
>> >>
>> >>         The advantage is that we protect (i.e., not compact) more
>> items than
>> >>         in the
>> >>         previous strategies when the buffer overflow is large.
>> >>
>> >>         *Strategy 4:* I think Region 1 has size Length / 2, instead of
>> nomCap
>> >>         / 2. The
>> >>         advantage is that when the buffer overflow is large we compact
>> more
>> >>         than in
>> >>         Strategy 3 (although not as many items as in Strategy 2),
>> while we
>> >>         also protect
>> >>         more items than in Strategy 2.
>> >>
>> >>         The disadvantage is that getMaxStoredItems function would be
>> least
>> >>         precise
>> >>         because sectionSize varies over time (at least, that's my
>> intuition).
>> >>         As for
>> >>         Strategy 3, our mathematical analysis does not go through
>> directly
>> >>         and there may
>> >>         be some adversarial instances on which it has a large error,
>> perhaps
>> >>         even more
>> >>         contrived that in the previous case.
>> >>
>> >>         So far, I have no clear recommendation, and haven't thought
>> much about
>> >>         calculating the confidence bounds (will get to it next week).
>> I would
>> >>         only
>> >>         exclude Strategy 1 and Strategy 3.
>> >>
>> >>         Best,
>> >>         Pavel
>> >>
>> >>         On 25.08.2020 at 23:40 leerho wrote:
>> >>         > I forgot to add the links:
>> >>         >
>> >>         > Main Code:
>> >>         >
>> >>
>> https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/main/java/org/apache/datasketches/req
>> >>         >
>> >>         > Test Code:
>> >>         >
>> >>
>> https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/test/java/org/apache/datasketches/req
>> >>         >
>> >>         > On Tue, Aug 25, 2020 at 2:10 PM leerho <leerho@gmail.com
>> >>         <ma...@gmail.com>
>> >>         > <mailto:leerho@gmail.com <ma...@gmail.com>>> wrote:
>> >>         >
>> >>         >     Folks,
>> >>         >
>> >>         >     Here are the links to the Java main code and the Java
>> test code
>> >>         for the
>> >>         >     new REQ sketch.  This code is in very initial state and
>> is in
>> >>         the branch
>> >>         >     "RelativeErrorQuantiles".  It has a lot of work left
>> before we
>> >>         merge it
>> >>         >     into master, so recognize it will be changing a lot.
>> >>         >
>> >>         >     Pavel and I had a long discussion about what the best (
>> & most
>> >>         practical)
>> >>         >     compaction strategy would be. Whatever strategy we
>> choose, we
>> >>         need to be
>> >>         >     confident that we have mathematics that will support the
>> >>         implementation
>> >>         >     and that we will be able to obtain reasonable confidence
>> bounds
>> >>         on the
>> >>         >     estimates produced by the sketch.  All of these
>> strategies
>> >>         assume that the
>> >>         >     low ranks are the most accurate. (This of course can be
>> changed.)
>> >>         >
>> >>         >     *Strategy 1.* My understanding of the paper is that it
>> proposes
>> >>         this
>> >>         >     strategy:
>> >>         >     The compactor has a maximum size given by a Capacity = 2
>> *
>> >>         numSections *
>> >>         >     sectionSize.  As soon as the compactor reaches capacity,
>> the
>> >>         compaction
>> >>         >     process starts.
>> >>         >
>> >>         >       * Region1: The bottom half, of size Capacity / 2, and
>> is
>> >>         never compacted.
>> >>         >       * Region2: The top half consists of /m/ sections of
>> size /k
>> >>         /(these can
>> >>         >         change as a f(numCompactions) but ignore for this
>> description)
>> >>         >           o The top-most of the m sections is always
>> compacted,
>> >>         leaving m-1
>> >>         >             sections to be potentially compacted.
>> >>         >           o Of the m-1 sections, choose the top m' of these
>> based
>> >>         on the
>> >>         >             distribution created from the number of trailing
>> ones
>> >>         in the
>> >>         >             number of compactions that have already occurred
>> in
>> >>         this compactor.
>> >>         >
>> >>         >     This will create the most number of compactions and the
>> most
>> >>         levels per
>> >>         >     value of N.
>> >>         >
>> >>         >     *Strategy 2.* This strategy is what Pavel suggested in
>> his
>> >>         Python code:
>> >>         >     The compactor has a Nominal Capacity and an actual
>> Length that
>> >>         can exceed
>> >>         >     the nomCap. The nomCap is defined as above, where nomCap
>> = 2 *
>> >>         numSections
>> >>         >     * sectionSize for each compactor. The sketch has a
>> maxSize =
>> >>         >     sum(compactor[i].nomCap() ).  When this maxSize is
>> exceeded the
>> >>         >     compression process starts which chooses the lowest
>> numbered
>> >>         compactor
>> >>         >     that exceeds its nomCap to compact.  If this turns out
>> to be
>> >>         the top
>> >>         >     compactor, then a new compactor is added at the top.
>> >>         >
>> >>         >     By allowing a compactor to exceed its nomCap, we keep
>> more
>> >>         values at a
>> >>         >     lower level improving accuracy.  But the issue is what
>> to do
>> >>         with the
>> >>         >     "overflow" items during compaction.  This is addressed
>> >>         differently by this
>> >>         >     strategy and the next two.
>> >>         >
>> >>         >       * This compactor has 3 regions:
>> >>         >           o Region1: of size nomCap / 2 is never compacted.
>> >>         >           o Region2: of size numSections * sectionsSize
>> >>         >           o Region3: of size Length - (Region1 + Region2);
>> # this
>> >>         is the
>> >>         >             "overflow" and is always compacted.
>> >>         >       * In this strategy all of Region3 plus the randomly
>> chosen,
>> >>         top m'
>> >>         >         sections of Region 2 will be compacted.
>> >>         >
>> >>         >
>> >>         >     *Strategy 3.*  This strategy is what I have currently
>> >>         implemented in Java
>> >>         >     (and my version of Pavel's Python) (because I didn't
>> fully
>> >>         understand what
>> >>         >     Pavel had in mind :) ):
>> >>         >     This strategy is similar to Strategy 2, and also allows
>> for
>> >>         overflow. The
>> >>         >     way the regions are treated is different:
>> >>         >
>> >>         >       * This compactor also has 3 regions:
>> >>         >           o Region 1 of size nomCap / 2 is never compacted
>> >>         >           o Region 2 is the overflow region of size Length -
>> >>         nomCap. It is
>> >>         >             also not compacted.
>> >>         >           o Region 3 of size numSections * sectionsSize.
>> The random m'
>> >>         >             sections will be compacted based on the
>> >>         f(numCompactions) as above.
>> >>         >
>> >>         >     This keeps more values behind and would maintain better
>> >>         accuracy longer,
>> >>         >     but perhaps causing more compactions than #2.
>> >>         >
>> >>         >     *Strategy 4.*  This strategy is similar to #2 and #3,
>> but again
>> >>         different
>> >>         >     in how the overflow values are treated:
>> >>         >
>> >>         >       * This compactor has 2 regions:
>> >>         >           o Region 1 of size nomCap / 2 is never compacted
>> >>         >           o Region 2 of size numSections * sectionSize',
>> where the
>> >>         >             sectionSize' is dynamically adjusted so that
>> numSections *
>> >>         >             sectionSize' consumes most of Length - Region1.
>> The
>> >>         number m' is
>> >>         >             chosen as before to determine what sections to
>> compact.
>> >>         >
>> >>         >     All of these strategies are implementable and have
>> different
>> >>         tradeoffs
>> >>         >     with respect to accuracy, size and speed.
>> >>         >
>> >>         >     *Questions:*
>> >>         >
>> >>         >     1. Which of these strategies can provide:
>> >>         >
>> >>         >       * An a priori calculation of the sketch size as a f(k,
>> N).
>> >>         >       * An a priori calculation of estimation error with
>> confidence
>> >>         intervals.
>> >>         >       * An a posteriori calculation of confidence intervals
>> based
>> >>         on the state
>> >>         >         of the sketch.
>> >>         >       * A defense that claims that the sketch implementation
>> is
>> >>         backed by
>> >>         >         mathematical proofs of correct operation with
>> provable
>> >>         error properties.
>> >>         >
>> >>         >     2.  What is your recommendation?
>> >>         >
>> >>         >     Note: I may implement all 4 strategies, so that we can
>> fully
>> >>         characterize
>> >>         >     them, but I would prefer if we could limit these options
>> to one
>> >>         (or two)
>> >>         >     to put into production.  :)
>> >>         >
>> >>         >     Your help with these questions would be appreciated.
>> >>         >
>> >>         >     Lee.
>> >>         >
>> >>
>>
>