You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@datasketches.apache.org by Lee Rhodes <lr...@verizonmedia.com.INVALID> on 2020/05/06 20:37:17 UTC

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this
thread to our dev@datasketches.apache.org so that our whole team can
contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us
what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python
product in the next few weeks.  We have fixed a number of stability issues
and bugs, which may solve the problem.  Nonetheless, we want to work with
you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We
have real-time systems today that generate and process over 1e9 sketches
every day.  Unfortunately our experience tells us that looping in Python
code will be 10 to 100 times slower than Java or C++.  This is because the
code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be
"easily modified".

NumPy achieves its speed performance by doing all of the matrix operations
in pre-compiled C++ code.  To achieve best performance, we would want to
read and loop through the NumPy data structure on the C++ side leveraging
the C++ DataSketches library directly.  I am not sure what would be
involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with
our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C24e5e2d8c64f4f76f96f08d7f151bb62%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637243205647507766&sdata=BYuKhANiQF83uLoYdGS58mRRtCP6aDfG4f7Zg7CMd%2Bc%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

It'd belong in the sketch description, as the type is defined when the
sketch is instantiated. Please create an issue if you find the
documentation lacking.

  jon

On Mon, May 11, 2020 at 4:58 PM leerho <le...@gmail.com> wrote:

> Then we need clear documentation to explain that in the update method(s).
>
> On Mon, May 11, 2020 at 4:17 PM Jon Malkin <jo...@gmail.com> wrote:
>
>> C++ KLL is templatized so it can accept any user-defined type. 32-bit
>> floats are only a requirement if data portability to Java is essential.
>> There is no requirement that every c++ kll_sketch created be portable. We
>> are unable to enforce that.
>>
>> And I already create an issue for that NaN bug. Included a link to it in
>> the message you just replied to, even :)
>>
>>   jon
>>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

Then we need clear documentation to explain that in the update method(s).

On Mon, May 11, 2020 at 4:17 PM Jon Malkin <jo...@gmail.com> wrote:

> C++ KLL is templatized so it can accept any user-defined type. 32-bit
> floats are only a requirement if data portability to Java is essential.
> There is no requirement that every c++ kll_sketch created be portable. We
> are unable to enforce that.
>
> And I already create an issue for that NaN bug. Included a link to it in
> the message you just replied to, even :)
>
>   jon
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

C++ KLL is templatized so it can accept any user-defined type. 32-bit
floats are only a requirement if data portability to Java is essential.
There is no requirement that every c++ kll_sketch created be portable. We
are unable to enforce that.

And I already create an issue for that NaN bug. Included a link to it in
the message you just replied to, even :)

  jon

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

   - We need to be clear about input types.  Apparently a np.float is
   equivalent to the python float, which is 64 bits.  Meanwhile, NumPy has
   specific 32 and 64 bit float types: np.float32 and np.float64.   We
   specifically chose 32 bit float types for our implementations of KLL, to
   save space as 32 bit floats have more than enough precision for nearly all
   quantile operations.
   - If our C++ KLL is accepting NaNs it is a bug.

Lee.

On Mon, May 11, 2020 at 2:40 PM Jon Malkin <jo...@gmail.com> wrote:

> Hi Michael,
>
> That's not expected! Created an issue for this:
> https://github.com/apache/incubator-datasketches-cpp/issues/133
>
> Thanks for catching that. Will look at it in a bit.
>
>   jon
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Thanks for taking a look, Jon.
>>
>> I pushed an update that address 2 & 4.
>>
>> #3 is actually something I had a question about. I've tested passing
>> numpy.nan into the update function, and it doesn't appear to break anything
>> (min, max, etc all still work correctly).  However, the reported number of
>> items per sketch counts the nan entries.  Is this the expected behavior, or
>> should the get_n() method return a number that does not count the nans it
>> has seen?  I expected the latter, so I'm worried that numpy's nan is being
>> treated differently.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Monday, May 11, 2020 4:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> I didn't look in super close detail, but the code overall looks pretty
>> good. Comments are below.
>>
>> Note that not all of these necessarily need changes or replies. I'm just
>> trying to document things we'll want to think about for keeping the library
>> general-purpose (and we can always make changes after merging, of course).
>>
>> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
>> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
>> to operate on an entire vector at a time (vs treating each dimension
>> independently) that'd become confusing. I think an inherently vectorized
>> version would be a very different beast, but I always worry I'm not being
>> imaginative enough. If merging into the Apache codebase, I'd probably wait
>> to see what the file looks like with the renaming before a final decision
>> on moving to its own file.
>>
>> 2. What happens if the input to update() has >2 dimensions? If that'd be
>> invalid, we should explicitly check and complain. If it'll Do The Right
>> Thing by operating on the first 2 dimensions (meaning correct indices)
>> that's fine, but otherwise should probably complain.
>>
>> 3. Can this handle sparse input vectors? Not sure how important that is
>> in general, even if your project doesn't require it. kll_sketch will ignore
>> NaNs, so those appearing would mean the number of items per sketch can
>> already differ.
>>
>> 4. I'd probably eat the very slightly increased space and go with 32 bits
>> for the number of dimensions (aka number of sketches). If trying to look at
>> a distribution of values for some machine learning application, it'd be
>> easy to overflow 65k dimensions for some tasks.
>>
>> 5. I imagine you've realized that it's easiest to do unit tests from
>> python in this case. That's another advantage of having this live in the
>> wrapper.
>>
>> 6. Finally, that assert issue is already obsolete :). Asserts were
>> converted if/throw exceptions late last week. It'll be flagged as a
>> conflict in merging, so no worries for now.
>>
>> Looking good at this point. And as I said, not all of these need changes
>> or comments from you.
>>
>>   jon
>>
>> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
>> file -- I'll leave it to you to decide if it's better as its own file.
>>
>> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
>> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
>> an include of assert.h there and then it compiled without issue.  It's
>> possible that other compilers will also complain about that, so maybe this
>> is a good update to the main branch.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 10:47 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> My only comment without having looked at actual code is that the new
>> class would be more appropriate in the python wrapper. Maybe even drop it
>> in as it's own file, as that would decrease recompile time a bit when
>> debugging (that's pybind's suggestion, anyway). Probably not a huge
>> difference with how light these wrappers are.
>>
>> If this is something that becomes widely used, to where we look at
>> pushing it into the base library, we'd look at whether we could share any
>> data across sketches. But we're far from that point currently. It'd be nice
>> to need to consider that.
>>
>>   jon
>>
>> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,  this has been a great interchange and certainly will allow us
>> to move forward more quickly.
>>
>> Thank you for working on this on a Mother's Day Sunday!
>>
>> I'm sure Alex and Jon may have more questions, when they get a chance to
>> look at it starting tomorrow.
>>
>> Cheers, and be safe and well!
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Re: testing, so far I've just done glorified unit tests for uniform and
>> normal distributions of varying sizes.  I plan to do some timing tests vs
>> the existing single-sketch Python class to see how it compares for 1, 10,
>> and 100 streams.
>>
>> 1. That makes sense.  One option to allow full Numpy compatibility but
>> without requiring a Python user to use Numpy would be to return everything
>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>> lists into arrays, and non-Numpy users would be unaffected (aside from
>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>> set when instantiating the object that would control whether things are
>> returned as lists or arrays, though this still requires the numpy.h header
>> file.
>>
>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>> class called kll_sketches, which spawns a user-specified number of
>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>> the existing code for that.  For fast execution in Python, the parallel
>> sketches must be spawned in C++, but the existing Python object could only
>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>
>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>> todo's, and that is one of them -- the plan is to do like you described and
>> call the relevant kll_sketch method on each of the sketches and return that
>> to Python in a sensible format.  For deserialization, it would just iterate
>> through them and load them into the kll_sketches object.  I don't require
>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>> sometime this week after I finish my work for the day, shouldn't take long
>> to do.
>>
>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>> thought is that since under the hood everything is using the existing
>> kll_sketch class, it would have full compatibility with the rest of the
>> library (once SerDe is added in).
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 8:42 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
>> a closer look this next week.  They wrote this code so they are much closer
>> to it than I.
>>
>> What you have done so far makes sense for you as you want to get this
>> working in the NumPy environment as quickly as possible.  As soon as we
>> start thinking about incorporating this into our library other concerns
>> become important.
>>
>> 1. Adding API calls is the recommended way to add functionality (like
>> NumPy) to a library.  We cannot change API calls in a way that is only
>> useful with NumPy, because it would seriously impact other users of the
>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>> exist in the same sketch API, then we need to consider other alternatives.
>>
>> 2.  Based on our previous discussions, I didn't envision that you would
>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>> class that enables vectorized input to a vector of sketches and a
>> vectorized get result that creates a vector result from a vector of
>> sketches.  This would isolate the changes you need for NumPy from the
>> sketch itself.  This is also much easier to support, maintain and debug.
>>
>> 3. If you don't change the internals of the sketch then SerDe becomes
>> pretty straightforward. I don't know if you need a single serialization
>> that represents a full vector of sketches,  but if you do, then I would
>> just iterate over the individual serdes and figure out how to package it.
>> I really don't think you want to have to rewrite this low-level stuff.
>>
>> 4. Binary compatibility is critically important for us and I think will
>> be important for you as well.  There are two dimensions of binary
>> compatibility: history and language.  This means that a kll sketch
>> serialized from Java, can be successfully read by C++ and visa versa.
>> Similarly, a kll sketch serialized today will be able to be read many years
>> from now.     Another aspect of this would mean being able to collect, say,
>> 100 sketches that were not created using the NumPy version, and being able
>> to put them together in a NumPy vector; and visa versa.
>>
>> I hope all of this make sense to you.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,
>> This is great!  What testing have you been able to do so far?
>>
>>
>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604586993&sdata=A%2F4%2B1LIzTcBIn5kZG62FPC5zMbX6neTBzzbRRrDg9bU%3D&reserved=0>
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=jIPYbqCi0PFQpqKmxUqDRwLhRZYt9mODB%2Fd86O18Txo%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=miV7FSNEyAv5iWo%2Be%2BZQHAgJmkZyYhEdrb38qGMcjCQ%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> dev@datasketches.apache.org email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> dev-subscribe@datasketches.apache.org.
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <jo...@gmail.com>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <de...@datasketches.apache.org>
>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=DhO648TzVPwtv7TqAiDUgzJLX8F7EO1QUTobVDzvea0%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604606907&sdata=RnRoQTIpyUcumR1LqItPS3LJ%2FKf%2BIYuxUO8ZloKKkaA%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Hi Michael,

That's not expected! Created an issue for this:
https://github.com/apache/incubator-datasketches-cpp/issues/133

Thanks for catching that. Will look at it in a bit.

  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604586993&sdata=A%2F4%2B1LIzTcBIn5kZG62FPC5zMbX6neTBzzbRRrDg9bU%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=jIPYbqCi0PFQpqKmxUqDRwLhRZYt9mODB%2Fd86O18Txo%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=miV7FSNEyAv5iWo%2Be%2BZQHAgJmkZyYhEdrb38qGMcjCQ%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=DhO648TzVPwtv7TqAiDUgzJLX8F7EO1QUTobVDzvea0%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604606907&sdata=RnRoQTIpyUcumR1LqItPS3LJ%2FKf%2BIYuxUO8ZloKKkaA%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Interesting. The interface wasn't supposed to have changed. I'll take a
look.

  jon

On Mon, May 25, 2020, 1:36 PM Michael Himes <mh...@knights.ucf.edu> wrote:

> Hi Jon,
>
> Just got around to testing it out.  Maybe I am doing something wrong here,
> but I can't get the code to work correctly.  Here's the code:
>
> import numpy as np
> from datasketches import kll_floatarray_sketches
> k = 160
> d = 3
> kll = kll_floatarray_sketches(k, d)
> for i in range(1e6):
>   kll.update(np.random.randn(d))
>
> And here's the error:
>
> TypeError: 'float' object cannot be interpreted as an integer
>
> Seems like the inputs have changed, but the inputs in the code look pretty
> similar.  Can you point out what I'm doing wrong here?
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Friday, May 22, 2020 6:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
>
> My default is to treat an input vector x being treated as a column vector
> -- the generic quadratic form x't A x assumes that, for instance. But might
> be an engineering thing. Following your approach for now and eventually we
> can debate whether to transpose the matrix if one dimension matches the
> number of sketches in the object but not the expected one.
>
> Anyway, I looked more at the docs and see them using unchecked references
> (after doing a bounds check) so I switched to that, and then I added in a
> check for c-style vs fortran-style indexing so that I believe it'll have
> the inner loop over the native dimension. In theory it'll walk linearly
> through the matrix. That or I got it exactly backwards and am thrashing
> some cache level, one of the two :D
>
> If you have some time, please check out the branch and play with it for a
> bit to ensure it's still behaving as you expect. Then we can figure out
> some relevant unit tests,
>
>   jon
>
>
>
> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913825916&sdata=z%2FVd2xLJ2aH9JipvQ0VHajUzODAEXnoaWnmW8R0Uto8%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913825916&sdata=Me8q2EsZDD%2BXmnJeJGk0qhsEMKZY7%2B0XR6ZdvqWcjnU%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913835907&sdata=QJ%2FTBOfEitPXYUpF%2FRjwc9BmosxrY7io8lybyn6Bzoo%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913835907&sdata=q6q89j9%2FwSJxmyS3LFqf5KhcmNz9VeH4cD8kcgyxWq8%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913845906&sdata=7es%2BXHzYMBg6OC2zZrXcXrRznDi2E8O1LLaVOQ%2FXfD4%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913845906&sdata=y3OVl0O5S0cuf1PvIhZsSGqRcXpEbjVDFgjJlAxqBE0%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Disregard, I figured it out.  The code was accessing some of the wrong indices, hence the zeros.  Fixed it and pushed.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>
Sent: Tuesday, June 30, 2020 12:59 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I wrote up some unit tests for 2D and 3D cases, and I'm noticing some strange behavior.  I think it has to do with the py array object not being densely packed, because when I print out some of the values they are 0 in C++ but nonzero in Python.  Related, the recorded min/max values do not always match the true min/max values found in Python.

As best as I can tell, this behavior is due to the use of array.template rather than accessing the underlying buffer, as I had initially done the early versions of the vectorized code.  The buffer is guaranteed to be densely packed, but it seems that the array.template is not.  I think this can be resolved by figuring out the stride size of the array.template, but I'm also not sure if the stride is guaranteed to be constant throughout the array.template.  I'll keep poking at it as I have time.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>
Sent: Friday, June 26, 2020 9:43 AM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Alright, I submitted the pull request.
https://github.com/apache/incubator-datasketches-cpp/pull/161<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F161&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929433534&sdata=rtWIYbrxnQxHVUVwF291VUCpbM%2FvyxXUSH3lse6Ej9Q%3D&reserved=0>

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Friday, June 26, 2020 3:26 AM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Yeah, worth fixing. I guess that's fine since it'll check the dimensionality right after in case you attempt to feed it 3+ dimensions.

  jon

On Thu, Jun 25, 2020 at 1:15 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Looked into this a little bit more.  I tried flipping the axes of my Python array, but then the code crashes because it tries to access an index that is out of memory.

Assuming that I am not just totally botching the usage, I made the following change
https://github.com/mdhimes/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L191<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L191&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929433534&sdata=MNtz3IyKWKewdD0I39e3w9XgG%2BzTTX6D4xYDXMBPr0A%3D&reserved=0>
and it works as expected now.  I can create a pull request if needed.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>>
Sent: Thursday, June 25, 2020 2:22 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Jon,

Just got around to updating my project to point to the current 2.0 version of Datasketches, and I'm hitting an error.

Whenever I pass in multi-dimensional arrays to update(), it complains about the dimensions.  The Numpy array has, e.g., 4 rows of 600 elements (array.shape gives (4, 600)), but I get a ValueError saying that the "input data must have rows with 600 elements. Found: 4".

Looking in the code here https://github.com/apache/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L188<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L188&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929443529&sdata=5C09ABrUEiCBvP08imeIY1VuqZh1k%2FvCwTKjq2TYDfA%3D&reserved=0>
it seems to be expecting an array of shape (N_elements, N_rows), rather than (N_rows, N_elements).  Shouldn't the comparison be with items.shape(1) rather than items.shape(0)?  Or am I thinking about this backwards?

Thanks,
Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, June 8, 2020 3:14 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Got distracted from this by a series of bugs both in our c++ release candidate as well as in another repo. Anyway, finally finished things and created a PR to push this into master. Feel free to comment: https://github.com/apache/incubator-datasketches-cpp/pull/156<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F156&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929443529&sdata=wPnQutZALS2Pp5420XxfPB4JwQludxaiDFj7d%2Bw38zg%3D&reserved=0>

  jon

On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Sounds good to me.

I've been thinking more about merging, and I think selectively merging individual dimensions would probably be unnecessary (if you have 1000 streams, why selectively merge 2 of those?).  But, one thing that I think might be a good idea to implement is to be able to merge all of the sketches into 1 sketch.  I don't need this for my work, but I can imagine an application where there are N streams of the same type of data and this would be useful.  Something to think about in the future.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 8:13 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Oh, and for merging, I'll make sure that both objects have the same number of dimensions and then merge things in. Should be fairly straightforward. Not going to support selectively merging individual dimensions, at least for now.

  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com>> wrote:
Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev) and we'll go with vector_of_kll_sketches as the c++ object name. Probably something similar in python. So gotta do that, and then clean up unit test names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon

On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929453528&sdata=rIxEQveYT%2BMMlLxanKEipPaE5wMKZbV3MqMBTqRb444%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.

On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929453528&sdata=Kh7b1LPRON%2F%2FbHa7uzglqFboT9%2BAhah1N3aK5mNE6ic%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929463524&sdata=5mp4h2uM7BYb2QuJlYWVnqpL4qOkL5v8omVQB%2BjRoL0%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929463524&sdata=a3IxM7IzbtHw%2Bjxxs34g3%2BBjt0JtKN9cvLnfK9R76us%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929473516&sdata=TZb5ATrJ8LOrQAUYJIsmhlENyublK%2BaxEDL11qnIatQ%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1f63ccf332eb4e13804208d81d16ffad%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637291331929473516&sdata=ivhnZ8DKeBeLa6bHGVLIzTzzsmq0bhRgL4yLl8%2FFo1o%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

I wrote up some unit tests for 2D and 3D cases, and I'm noticing some strange behavior.  I think it has to do with the py array object not being densely packed, because when I print out some of the values they are 0 in C++ but nonzero in Python.  Related, the recorded min/max values do not always match the true min/max values found in Python.

As best as I can tell, this behavior is due to the use of array.template rather than accessing the underlying buffer, as I had initially done the early versions of the vectorized code.  The buffer is guaranteed to be densely packed, but it seems that the array.template is not.  I think this can be resolved by figuring out the stride size of the array.template, but I'm also not sure if the stride is guaranteed to be constant throughout the array.template.  I'll keep poking at it as I have time.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>
Sent: Friday, June 26, 2020 9:43 AM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Alright, I submitted the pull request.
https://github.com/apache/incubator-datasketches-cpp/pull/161<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F161&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394823246&sdata=yHBvqKU%2FSK3WuEdu4HZndqD5qCHhhVVkICjhocQRGR0%3D&reserved=0>

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Friday, June 26, 2020 3:26 AM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Yeah, worth fixing. I guess that's fine since it'll check the dimensionality right after in case you attempt to feed it 3+ dimensions.

  jon

On Thu, Jun 25, 2020 at 1:15 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Looked into this a little bit more.  I tried flipping the axes of my Python array, but then the code crashes because it tries to access an index that is out of memory.

Assuming that I am not just totally botching the usage, I made the following change
https://github.com/mdhimes/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L191<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L191&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394833235&sdata=XA9grUITmO89ai8S3zei%2FAcR9u0apErhOJmnjrpUzCg%3D&reserved=0>
and it works as expected now.  I can create a pull request if needed.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>>
Sent: Thursday, June 25, 2020 2:22 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Jon,

Just got around to updating my project to point to the current 2.0 version of Datasketches, and I'm hitting an error.

Whenever I pass in multi-dimensional arrays to update(), it complains about the dimensions.  The Numpy array has, e.g., 4 rows of 600 elements (array.shape gives (4, 600)), but I get a ValueError saying that the "input data must have rows with 600 elements. Found: 4".

Looking in the code here https://github.com/apache/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L188<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L188&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394833235&sdata=RyR%2FHmb6phP2fN8t4CsC6cDoc1x6Dy3TF5wYsmjtyGE%3D&reserved=0>
it seems to be expecting an array of shape (N_elements, N_rows), rather than (N_rows, N_elements).  Shouldn't the comparison be with items.shape(1) rather than items.shape(0)?  Or am I thinking about this backwards?

Thanks,
Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, June 8, 2020 3:14 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Got distracted from this by a series of bugs both in our c++ release candidate as well as in another repo. Anyway, finally finished things and created a PR to push this into master. Feel free to comment: https://github.com/apache/incubator-datasketches-cpp/pull/156<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F156&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394843230&sdata=HQcWkWxr0N0GEIclcmXdMS1W%2Fdtm%2BwX9sbvctk5RnBk%3D&reserved=0>

  jon

On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Sounds good to me.

I've been thinking more about merging, and I think selectively merging individual dimensions would probably be unnecessary (if you have 1000 streams, why selectively merge 2 of those?).  But, one thing that I think might be a good idea to implement is to be able to merge all of the sketches into 1 sketch.  I don't need this for my work, but I can imagine an application where there are N streams of the same type of data and this would be useful.  Something to think about in the future.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 8:13 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Oh, and for merging, I'll make sure that both objects have the same number of dimensions and then merge things in. Should be fairly straightforward. Not going to support selectively merging individual dimensions, at least for now.

  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com>> wrote:
Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev) and we'll go with vector_of_kll_sketches as the c++ object name. Probably something similar in python. So gotta do that, and then clean up unit test names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon

On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394843230&sdata=RMfaF%2FGtrBnU2D9UgfP2JmbG1D0T9epBLV6JZJAFN38%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.

On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394853226&sdata=5HZa4DI1wBfDF%2BMkGJlBYWEO7P99GPOsSz0cXPiyXtg%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394853226&sdata=v8NVGjFUUzVmyCpj1wdJyta991fcruqt0R6ry5s7NmM%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394863217&sdata=ywnfzxJmJpGw0pvvE%2BXB%2BCMU8DI%2FDqgwswbTtnnvpP0%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394863217&sdata=hZbwywQRLKM1LO%2BskulNd5DkJau0o8ZNbJzk%2Bj%2B5Mmw%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3770f6076ac24ceeb44108d819d6f89f%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287758394873213&sdata=6HjZ5VkQqLQd%2FoiWITtGJX90%2F5QPhE5z6Vu6Trwq%2Bs8%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Alright, I submitted the pull request.
https://github.com/apache/incubator-datasketches-cpp/pull/161

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Friday, June 26, 2020 3:26 AM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Yeah, worth fixing. I guess that's fine since it'll check the dimensionality right after in case you attempt to feed it 3+ dimensions.

  jon

On Thu, Jun 25, 2020 at 1:15 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Looked into this a little bit more.  I tried flipping the axes of my Python array, but then the code crashes because it tries to access an index that is out of memory.

Assuming that I am not just totally botching the usage, I made the following change
https://github.com/mdhimes/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L191<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L191&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094360616&sdata=bDLdnU8Qy8BEg%2BMSV3HrRYdSg9PRWktb1N0L2SqFHhI%3D&reserved=0>
and it works as expected now.  I can create a pull request if needed.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>>
Sent: Thursday, June 25, 2020 2:22 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Jon,

Just got around to updating my project to point to the current 2.0 version of Datasketches, and I'm hitting an error.

Whenever I pass in multi-dimensional arrays to update(), it complains about the dimensions.  The Numpy array has, e.g., 4 rows of 600 elements (array.shape gives (4, 600)), but I get a ValueError saying that the "input data must have rows with 600 elements. Found: 4".

Looking in the code here https://github.com/apache/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L188<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L188&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094360616&sdata=QrOOBEPKuHrpYhSMG4opY4c%2FZeiFq2nJtq6opa5uOJk%3D&reserved=0>
it seems to be expecting an array of shape (N_elements, N_rows), rather than (N_rows, N_elements).  Shouldn't the comparison be with items.shape(1) rather than items.shape(0)?  Or am I thinking about this backwards?

Thanks,
Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, June 8, 2020 3:14 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Got distracted from this by a series of bugs both in our c++ release candidate as well as in another repo. Anyway, finally finished things and created a PR to push this into master. Feel free to comment: https://github.com/apache/incubator-datasketches-cpp/pull/156<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F156&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094370612&sdata=EFqfuB2Lbu89ARET%2Fj68TR9G7k7028C6KtXEK9MUc0Y%3D&reserved=0>

  jon

On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Sounds good to me.

I've been thinking more about merging, and I think selectively merging individual dimensions would probably be unnecessary (if you have 1000 streams, why selectively merge 2 of those?).  But, one thing that I think might be a good idea to implement is to be able to merge all of the sketches into 1 sketch.  I don't need this for my work, but I can imagine an application where there are N streams of the same type of data and this would be useful.  Something to think about in the future.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 8:13 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Oh, and for merging, I'll make sure that both objects have the same number of dimensions and then merge things in. Should be fairly straightforward. Not going to support selectively merging individual dimensions, at least for now.

  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com>> wrote:
Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev) and we'll go with vector_of_kll_sketches as the c++ object name. Probably something similar in python. So gotta do that, and then clean up unit test names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon

On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094370612&sdata=ZIwD%2BE6SgK%2Bg%2FamAjyPosHmpWmQcmHPEWnFD328ELhU%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.

On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094380605&sdata=%2FR7Bvdacm5GVPulHtN8YtLQik9XZ7HGW2o2o2TklPfg%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094380605&sdata=gLscXTLVqczFcBHdEKQ7xkEGZJJUItCXf9i92SH43yI%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094390600&sdata=yuua9nWIfEopCoWmaxz6UWVhSX7xyQjwEtfgM0VTzy0%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094390600&sdata=h%2BLfCTcr64yQ2YC9flve%2BpbbNYJqxCeqfy0r92PHsWc%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Ca66bd0405be6485a019d08d819a2483d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287532094400595&sdata=LKPaWHZgQdqkQnUa%2BfcjkECI3vWCOeTtnHs2sYOjNBg%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Yeah, worth fixing. I guess that's fine since it'll check the
dimensionality right after in case you attempt to feed it 3+ dimensions.

  jon

On Thu, Jun 25, 2020 at 1:15 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Looked into this a little bit more.  I tried flipping the axes of my
> Python array, but then the code crashes because it tries to access an index
> that is out of memory.
>
> Assuming that I am not just totally botching the usage, I made the
> following change
>
> https://github.com/mdhimes/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L191
> and it works as expected now.  I can create a pull request if needed.
>
> Michael
> ------------------------------
> *From:* Michael Himes <mh...@knights.ucf.edu>
> *Sent:* Thursday, June 25, 2020 2:22 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Jon,
>
> Just got around to updating my project to point to the current 2.0 version
> of Datasketches, and I'm hitting an error.
>
> Whenever I pass in multi-dimensional arrays to update(), it complains
> about the dimensions.  The Numpy array has, e.g., 4 rows of 600 elements
> (array.shape gives (4, 600)), but I get a ValueError saying that the "input
> data must have rows with 600 elements. Found: 4".
>
> Looking in the code here
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L188
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L188&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410154156&sdata=0HN4NeTeogKj8lf5fW36ohZaapUb5egqEjmhsuZTW14%3D&reserved=0>
> it seems to be expecting an array of shape (N_elements, N_rows), rather
> than (N_rows, N_elements).  Shouldn't the comparison be with items.shape(1)
> rather than items.shape(0)?  Or am I thinking about this backwards?
>
> Thanks,
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, June 8, 2020 3:14 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Got distracted from this by a series of bugs both in our c++ release
> candidate as well as in another repo. Anyway, finally finished things and
> created a PR to push this into master. Feel free to comment:
> https://github.com/apache/incubator-datasketches-cpp/pull/156
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F156&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410164155&sdata=%2BdDMJhpEFg9torDqwdXqkDoRMg3ASWNaWNkW5GvpZZg%3D&reserved=0>
>
>   jon
>
> On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Sounds good to me.
>
> I've been thinking more about merging, and I think selectively merging
> individual dimensions would probably be unnecessary (if you have 1000
> streams, why selectively merge 2 of those?).  But, one thing that I think
> might be a good idea to implement is to be able to merge all of the
> sketches into 1 sketch.  I don't need this for my work, but I can imagine
> an application where there are N streams of the same type of data and this
> would be useful.  Something to think about in the future.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 8:13 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Oh, and for merging, I'll make sure that both objects have the same number
> of dimensions and then merge things in. Should be fairly straightforward.
> Not going to support selectively merging individual dimensions, at least
> for now.
>
>
>   jon
>
> On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com> wrote:
>
> Thanks for that!
>
> We discussed things a bit on the ASF slack dev channel (datasketches-dev)
> and we'll go with vector_of_kll_sketches as the c++ object name. Probably
> something similar in python. So gotta do that, and then clean up unit test
> names. But it's in pretty good shape so far.
>
>   jon
>
> On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> That's a great motto to code by!
>
> I adapted the existing unit tests for kll_sketch to work for the new
> kll_sketches class, and everything seems to be working as intended.  Some
> things are not implemented -- merging, and the normalized_rank_error method
> (note that there is the get_normalized_rank_error static method) -- and are
> therefore not tested.  Once they are implemented into the class, then those
> tests can be added.
>
> I've submitted a pull request, let me know if there are any other tests
> you'd like before it's considered tested & working.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 1:53 AM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I think it now works for quantiles, rank, pmf, and cdf.
>
> This exercise is a good example of why my colleague operates by the motto
> that if it isn't tested, it's broken. In very much related news, we need
> unit tests for this thing, in either C++ or python (probably the latter
> unless we move it into the core C++ part of the repo).
>
>   jon
>
> On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Ah gosh, that was silly on my part.
>
> So, I ran the previous code without that silly mistake, then called
> kll.get_quantiles(0.5) and it threw this error:
>
> TypeError: get_quantiles(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floatarray_sketches, fractions:
> List[float], isk: numpy.ndarray[int32] = -1) -> array
>
> Invoked with: <datasketches.kll_floatarray_sketches object at
> 0x7f610ce7de30>, 0.5
>
> I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and
> it throws this error:
>
> ValueError: array has incorrect number of dimensions: 0; expected 1
>
> This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the
> Numpy array equivalent, even though it has 1 dimension, not 0.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 25, 2020 4:53 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> That's the range() command complaining -- 1e6 is a float, but range wants
> an int. It worked if I instead changed the line to
> for i in range(int(1e6)):
>
>   jon
>
> On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Jon,
>
> Just got around to testing it out.  Maybe I am doing something wrong here,
> but I can't get the code to work correctly.  Here's the code:
>
> import numpy as np
> from datasketches import kll_floatarray_sketches
> k = 160
> d = 3
> kll = kll_floatarray_sketches(k, d)
> for i in range(1e6):
>   kll.update(np.random.randn(d))
>
> And here's the error:
>
> TypeError: 'float' object cannot be interpreted as an integer
>
> Seems like the inputs have changed, but the inputs in the code look pretty
> similar.  Can you point out what I'm doing wrong here?
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Friday, May 22, 2020 6:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
>
> My default is to treat an input vector x being treated as a column vector
> -- the generic quadratic form x't A x assumes that, for instance. But might
> be an engineering thing. Following your approach for now and eventually we
> can debate whether to transpose the matrix if one dimension matches the
> number of sketches in the object but not the expected one.
>
> Anyway, I looked more at the docs and see them using unchecked references
> (after doing a bounds check) so I switched to that, and then I added in a
> check for c-style vs fortran-style indexing so that I believe it'll have
> the inner loop over the native dimension. In theory it'll walk linearly
> through the matrix. That or I got it exactly backwards and am thrashing
> some cache level, one of the two :D
>
> If you have some time, please check out the branch and play with it for a
> bit to ensure it's still behaving as you expect. Then we can figure out
> some relevant unit tests,
>
>   jon
>
>
>
> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410164155&sdata=hjySSv8ueGYk1UgDGfug5S43FYgBiadc49NYzf6mzjo%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410174143&sdata=Q4KQPDcpPYQ8fDSZAau6JWxAuYcIhA0EF0CmJW9fXCk%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410174143&sdata=UwpZMXsoC1DfVUrCuWFP6uhj0%2Ff4pv2mEQlxGxduSP8%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410184138&sdata=%2BVSOUYDQMXPZ2wlB4dnFrqNEfRdScJuF9RkyKuoz2Ss%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410184138&sdata=tIzhFK0qMoP3%2FO%2BIukG4TY94K1JBI0lKJXtO%2BfF%2FNJ4%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410194131&sdata=H7ntMbIqlEyIbySoGOJk7nVRo%2F4mLyUvkrH%2FxwuPFIE%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Looked into this a little bit more.  I tried flipping the axes of my Python array, but then the code crashes because it tries to access an index that is out of memory.

Assuming that I am not just totally botching the usage, I made the following change
https://github.com/mdhimes/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L191
and it works as expected now.  I can create a pull request if needed.

Michael
________________________________
From: Michael Himes <mh...@knights.ucf.edu>
Sent: Thursday, June 25, 2020 2:22 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Jon,

Just got around to updating my project to point to the current 2.0 version of Datasketches, and I'm hitting an error.

Whenever I pass in multi-dimensional arrays to update(), it complains about the dimensions.  The Numpy array has, e.g., 4 rows of 600 elements (array.shape gives (4, 600)), but I get a ValueError saying that the "input data must have rows with 600 elements. Found: 4".

Looking in the code here https://github.com/apache/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L188<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Fsrc%2Fvector_of_kll.cpp%23L188&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410154156&sdata=0HN4NeTeogKj8lf5fW36ohZaapUb5egqEjmhsuZTW14%3D&reserved=0>
it seems to be expecting an array of shape (N_elements, N_rows), rather than (N_rows, N_elements).  Shouldn't the comparison be with items.shape(1) rather than items.shape(0)?  Or am I thinking about this backwards?

Thanks,
Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Monday, June 8, 2020 3:14 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Got distracted from this by a series of bugs both in our c++ release candidate as well as in another repo. Anyway, finally finished things and created a PR to push this into master. Feel free to comment: https://github.com/apache/incubator-datasketches-cpp/pull/156<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F156&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410164155&sdata=%2BdDMJhpEFg9torDqwdXqkDoRMg3ASWNaWNkW5GvpZZg%3D&reserved=0>

  jon

On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Sounds good to me.

I've been thinking more about merging, and I think selectively merging individual dimensions would probably be unnecessary (if you have 1000 streams, why selectively merge 2 of those?).  But, one thing that I think might be a good idea to implement is to be able to merge all of the sketches into 1 sketch.  I don't need this for my work, but I can imagine an application where there are N streams of the same type of data and this would be useful.  Something to think about in the future.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 8:13 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Oh, and for merging, I'll make sure that both objects have the same number of dimensions and then merge things in. Should be fairly straightforward. Not going to support selectively merging individual dimensions, at least for now.


  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com>> wrote:
Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev) and we'll go with vector_of_kll_sketches as the c++ object name. Probably something similar in python. So gotta do that, and then clean up unit test names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon



On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410164155&sdata=hjySSv8ueGYk1UgDGfug5S43FYgBiadc49NYzf6mzjo%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410174143&sdata=Q4KQPDcpPYQ8fDSZAau6JWxAuYcIhA0EF0CmJW9fXCk%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410174143&sdata=UwpZMXsoC1DfVUrCuWFP6uhj0%2Ff4pv2mEQlxGxduSP8%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410184138&sdata=%2BVSOUYDQMXPZ2wlB4dnFrqNEfRdScJuF9RkyKuoz2Ss%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410184138&sdata=tIzhFK0qMoP3%2FO%2BIukG4TY94K1JBI0lKJXtO%2BfF%2FNJ4%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cd7c38c5514ec40b3346d08d81934b142%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637287061410194131&sdata=H7ntMbIqlEyIbySoGOJk7nVRo%2F4mLyUvkrH%2FxwuPFIE%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Hi Jon,

Just got around to updating my project to point to the current 2.0 version of Datasketches, and I'm hitting an error.

Whenever I pass in multi-dimensional arrays to update(), it complains about the dimensions.  The Numpy array has, e.g., 4 rows of 600 elements (array.shape gives (4, 600)), but I get a ValueError saying that the "input data must have rows with 600 elements. Found: 4".

Looking in the code here https://github.com/apache/incubator-datasketches-cpp/blob/master/python/src/vector_of_kll.cpp#L188
it seems to be expecting an array of shape (N_elements, N_rows), rather than (N_rows, N_elements).  Shouldn't the comparison be with items.shape(1) rather than items.shape(0)?  Or am I thinking about this backwards?

Thanks,
Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Monday, June 8, 2020 3:14 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Got distracted from this by a series of bugs both in our c++ release candidate as well as in another repo. Anyway, finally finished things and created a PR to push this into master. Feel free to comment: https://github.com/apache/incubator-datasketches-cpp/pull/156<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fpull%2F156&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097571777&sdata=8l73eFveMhHu9r02t8Q%2FikvIwQZENg5y9j8NnRAAlxg%3D&reserved=0>

  jon

On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Sounds good to me.

I've been thinking more about merging, and I think selectively merging individual dimensions would probably be unnecessary (if you have 1000 streams, why selectively merge 2 of those?).  But, one thing that I think might be a good idea to implement is to be able to merge all of the sketches into 1 sketch.  I don't need this for my work, but I can imagine an application where there are N streams of the same type of data and this would be useful.  Something to think about in the future.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 8:13 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Oh, and for merging, I'll make sure that both objects have the same number of dimensions and then merge things in. Should be fairly straightforward. Not going to support selectively merging individual dimensions, at least for now.


  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com>> wrote:
Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev) and we'll go with vector_of_kll_sketches as the c++ object name. Probably something similar in python. So gotta do that, and then clean up unit test names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon



On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097571777&sdata=7BorBo0voenhyul7uxh3mnGZxSVdQmr3zl2R6v2BxVQ%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097581769&sdata=0Fh9rbWtqPoQ%2BDaJt%2B8GZ6IszhoRPk7OsuKDjZ%2BlPp8%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097581769&sdata=UJE8ty3%2BUlHcZAUknf0xJ3uObA1oKwv6PNpeIhmvoBM%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097591759&sdata=k6TKt2Uf75w5UJJYDtGGP2xPAOy4q2fDKReYtPPiNMk%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097591759&sdata=zsj10uuhO%2FMOfnHsW%2FytobRJKxxmX1MLg6y1MroF4II%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e778bfc26ac4f9da49008d80be040ff%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637272405097601755&sdata=r0fa7fqimABBxfOc92n8iVuoR7ut3rhrYVoghAQz4jU%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Got distracted from this by a series of bugs both in our c++ release
candidate as well as in another repo. Anyway, finally finished things and
created a PR to push this into master. Feel free to comment:
https://github.com/apache/incubator-datasketches-cpp/pull/156

  jon

On Wed, May 27, 2020 at 6:25 AM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Sounds good to me.
>
> I've been thinking more about merging, and I think selectively merging
> individual dimensions would probably be unnecessary (if you have 1000
> streams, why selectively merge 2 of those?).  But, one thing that I think
> might be a good idea to implement is to be able to merge all of the
> sketches into 1 sketch.  I don't need this for my work, but I can imagine
> an application where there are N streams of the same type of data and this
> would be useful.  Something to think about in the future.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 8:13 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Oh, and for merging, I'll make sure that both objects have the same number
> of dimensions and then merge things in. Should be fairly straightforward.
> Not going to support selectively merging individual dimensions, at least
> for now.
>
>
>   jon
>
> On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com> wrote:
>
> Thanks for that!
>
> We discussed things a bit on the ASF slack dev channel (datasketches-dev)
> and we'll go with vector_of_kll_sketches as the c++ object name. Probably
> something similar in python. So gotta do that, and then clean up unit test
> names. But it's in pretty good shape so far.
>
>   jon
>
> On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> That's a great motto to code by!
>
> I adapted the existing unit tests for kll_sketch to work for the new
> kll_sketches class, and everything seems to be working as intended.  Some
> things are not implemented -- merging, and the normalized_rank_error method
> (note that there is the get_normalized_rank_error static method) -- and are
> therefore not tested.  Once they are implemented into the class, then those
> tests can be added.
>
> I've submitted a pull request, let me know if there are any other tests
> you'd like before it's considered tested & working.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 1:53 AM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I think it now works for quantiles, rank, pmf, and cdf.
>
> This exercise is a good example of why my colleague operates by the motto
> that if it isn't tested, it's broken. In very much related news, we need
> unit tests for this thing, in either C++ or python (probably the latter
> unless we move it into the core C++ part of the repo).
>
>   jon
>
> On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Ah gosh, that was silly on my part.
>
> So, I ran the previous code without that silly mistake, then called
> kll.get_quantiles(0.5) and it threw this error:
>
> TypeError: get_quantiles(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floatarray_sketches, fractions:
> List[float], isk: numpy.ndarray[int32] = -1) -> array
>
> Invoked with: <datasketches.kll_floatarray_sketches object at
> 0x7f610ce7de30>, 0.5
>
> I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and
> it throws this error:
>
> ValueError: array has incorrect number of dimensions: 0; expected 1
>
> This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the
> Numpy array equivalent, even though it has 1 dimension, not 0.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 25, 2020 4:53 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> That's the range() command complaining -- 1e6 is a float, but range wants
> an int. It worked if I instead changed the line to
> for i in range(int(1e6)):
>
>   jon
>
> On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Jon,
>
> Just got around to testing it out.  Maybe I am doing something wrong here,
> but I can't get the code to work correctly.  Here's the code:
>
> import numpy as np
> from datasketches import kll_floatarray_sketches
> k = 160
> d = 3
> kll = kll_floatarray_sketches(k, d)
> for i in range(1e6):
>   kll.update(np.random.randn(d))
>
> And here's the error:
>
> TypeError: 'float' object cannot be interpreted as an integer
>
> Seems like the inputs have changed, but the inputs in the code look pretty
> similar.  Can you point out what I'm doing wrong here?
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Friday, May 22, 2020 6:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
>
> My default is to treat an input vector x being treated as a column vector
> -- the generic quadratic form x't A x assumes that, for instance. But might
> be an engineering thing. Following your approach for now and eventually we
> can debate whether to transpose the matrix if one dimension matches the
> number of sketches in the object but not the expected one.
>
> Anyway, I looked more at the docs and see them using unchecked references
> (after doing a bounds check) so I switched to that, and then I added in a
> check for c-style vs fortran-style indexing so that I believe it'll have
> the inner loop over the native dimension. In theory it'll walk linearly
> through the matrix. That or I got it exactly backwards and am thrashing
> some cache level, one of the two :D
>
> If you have some time, please check out the branch and play with it for a
> bit to ensure it's still behaving as you expect. Then we can figure out
> some relevant unit tests,
>
>   jon
>
>
>
> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611275686&sdata=egCONuaLLD3rEYruprB6mL0b8z1TUDSxAx9y3I6r3Ok%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611285680&sdata=%2F3g7xm05jzYImdpiMxG8kLEvXrLW7W7VDClJzD%2Bga%2Fg%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611285680&sdata=qEh0TOmfy7D%2BocnV8%2Bz7lTDMfAYmPq4opF1ygN3XEAA%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611295676&sdata=s1RASa%2FlZ8h713r6eiIQEX5sg25eJSo246Xnfg75Iak%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611295676&sdata=qSmpc2vaHoEOWXJl6TGscBQlfULhFaeRCuQDjxWpcns%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611305671&sdata=xsW%2F%2B2R0%2BfrHWQJicZ4tyz9eEPqVSX1Rn2Qal7Y4gIs%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Sounds good to me.

I've been thinking more about merging, and I think selectively merging individual dimensions would probably be unnecessary (if you have 1000 streams, why selectively merge 2 of those?).  But, one thing that I think might be a good idea to implement is to be able to merge all of the sketches into 1 sketch.  I don't need this for my work, but I can imagine an application where there are N streams of the same type of data and this would be useful.  Something to think about in the future.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Tuesday, May 26, 2020 8:13 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Oh, and for merging, I'll make sure that both objects have the same number of dimensions and then merge things in. Should be fairly straightforward. Not going to support selectively merging individual dimensions, at least for now.

  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com>> wrote:
Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev) and we'll go with vector_of_kll_sketches as the c++ object name. Probably something similar in python. So gotta do that, and then clean up unit test names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon

On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611275686&sdata=egCONuaLLD3rEYruprB6mL0b8z1TUDSxAx9y3I6r3Ok%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.

On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611285680&sdata=%2F3g7xm05jzYImdpiMxG8kLEvXrLW7W7VDClJzD%2Bga%2Fg%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611285680&sdata=qEh0TOmfy7D%2BocnV8%2Bz7lTDMfAYmPq4opF1ygN3XEAA%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611295676&sdata=s1RASa%2FlZ8h713r6eiIQEX5sg25eJSo246Xnfg75Iak%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611295676&sdata=qSmpc2vaHoEOWXJl6TGscBQlfULhFaeRCuQDjxWpcns%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdd4e4457bd0648ea6e9b08d801d2e4f1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637261352611305671&sdata=xsW%2F%2B2R0%2BfrHWQJicZ4tyz9eEPqVSX1Rn2Qal7Y4gIs%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Oh, and for merging, I'll make sure that both objects have the same number
of dimensions and then merge things in. Should be fairly straightforward.
Not going to support selectively merging individual dimensions, at least
for now.


  jon

On Tue, May 26, 2020 at 4:52 PM Jon Malkin <jo...@gmail.com> wrote:

> Thanks for that!
>
> We discussed things a bit on the ASF slack dev channel (datasketches-dev)
> and we'll go with vector_of_kll_sketches as the c++ object name. Probably
> something similar in python. So gotta do that, and then clean up unit test
> names. But it's in pretty good shape so far.
>
>   jon
>
> On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> That's a great motto to code by!
>>
>> I adapted the existing unit tests for kll_sketch to work for the new
>> kll_sketches class, and everything seems to be working as intended.  Some
>> things are not implemented -- merging, and the normalized_rank_error method
>> (note that there is the get_normalized_rank_error static method) -- and are
>> therefore not tested.  Once they are implemented into the class, then those
>> tests can be added.
>>
>> I've submitted a pull request, let me know if there are any other tests
>> you'd like before it's considered tested & working.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Tuesday, May 26, 2020 1:53 AM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> I think it now works for quantiles, rank, pmf, and cdf.
>>
>> This exercise is a good example of why my colleague operates by the motto
>> that if it isn't tested, it's broken. In very much related news, we need
>> unit tests for this thing, in either C++ or python (probably the latter
>> unless we move it into the core C++ part of the repo).
>>
>>   jon
>>
>> On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Ah gosh, that was silly on my part.
>>
>> So, I ran the previous code without that silly mistake, then called
>> kll.get_quantiles(0.5) and it threw this error:
>>
>> TypeError: get_quantiles(): incompatible function arguments. The
>> following argument types are supported:
>>     1. (self: datasketches.kll_floatarray_sketches, fractions:
>> List[float], isk: numpy.ndarray[int32] = -1) -> array
>>
>> Invoked with: <datasketches.kll_floatarray_sketches object at
>> 0x7f610ce7de30>, 0.5
>>
>> I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and
>> it throws this error:
>>
>> ValueError: array has incorrect number of dimensions: 0; expected 1
>>
>> This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the
>> Numpy array equivalent, even though it has 1 dimension, not 0.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Monday, May 25, 2020 4:53 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> That's the range() command complaining -- 1e6 is a float, but range wants
>> an int. It worked if I instead changed the line to
>> for i in range(int(1e6)):
>>
>>   jon
>>
>> On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Jon,
>>
>> Just got around to testing it out.  Maybe I am doing something wrong
>> here, but I can't get the code to work correctly.  Here's the code:
>>
>> import numpy as np
>> from datasketches import kll_floatarray_sketches
>> k = 160
>> d = 3
>> kll = kll_floatarray_sketches(k, d)
>> for i in range(1e6):
>>   kll.update(np.random.randn(d))
>>
>> And here's the error:
>>
>> TypeError: 'float' object cannot be interpreted as an integer
>>
>> Seems like the inputs have changed, but the inputs in the code look
>> pretty similar.  Can you point out what I'm doing wrong here?
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Friday, May 22, 2020 6:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>>
>> My default is to treat an input vector x being treated as a column vector
>> -- the generic quadratic form x't A x assumes that, for instance. But might
>> be an engineering thing. Following your approach for now and eventually we
>> can debate whether to transpose the matrix if one dimension matches the
>> number of sketches in the object but not the expected one.
>>
>> Anyway, I looked more at the docs and see them using unchecked references
>> (after doing a bounds check) so I switched to that, and then I added in a
>> check for c-style vs fortran-style indexing so that I believe it'll have
>> the inner loop over the native dimension. In theory it'll walk linearly
>> through the matrix. That or I got it exactly backwards and am thrashing
>> some cache level, one of the two :D
>>
>> If you have some time, please check out the branch and play with it for a
>> bit to ensure it's still behaving as you expect. Then we can figure out
>> some relevant unit tests,
>>
>>   jon
>>
>>
>>
>> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Jon,
>>
>> Those changes sound great, as long as the data is being accessed
>> correctly. The pybind docs warn about accessing data through the array_t
>> object since it's not guaranteed to be contiguous in memory.  Typically,
>> they demonstrate accessing it through the buffer, which I followed.  But if
>> this is an unnecessary step, then great.
>>
>> As for the 2D case, here is my line of thinking.  For 1D, we have a
>> single row with d values.  So for 2D, we'd have n rows with d values, (n x
>> d).  I believe that is how I coded it, but it's possible I flipped the
>> dimensions.
>>
>> Michael
>>
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Thursday, May 21, 2020 7:17 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> I've restructured the object to be an actual C++ object with proper
>> methods. And then I've gotten rid of all the casts to buffer in favor of
>> just using the py::array_t<> that's passed in. Tha removes casting
>> everything to double, and allows for range checks. Now an attempt to access
>> sketch 7 in a 5-d array doesn't just segfault :)
>>
>> Looking at pybind docs a bit more, it seems there are no hard guarantees
>> on data layout in memory with numpy arrays -- if you transpose one, walking
>> through with a pointer will return items In the wrong order. So update()
>> ends up using items.at
>> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194043028&sdata=2MXHFn4gbKLwosvIWAvl%2FQtyt1fIrGkd40AAMW%2FxQg8%3D&reserved=0>()
>> instead (more on that in a moment). The whole thing is probably also
>> copying values around more than necessary. Anyway, we can look at ways to
>> optimize such things eventually, but for now I'm working on ensuring
>> correctness and at least somewhat graceful failure.
>>
>> Anyway, item input order. If we have 1-d input, we implicitly assume we
>> want d updates, one for each dimension in the object. It seems like the
>> default for numpy is row-major order, which makes sense given C beneath the
>> hood. But for inputting n points at a time, do you expect the matrix to be
>> (d x n) or (n x d)?
>>
>>   jon
>>
>> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Re: the template type A, I set that for the Python array data type.  A
>> Python float is 64 bits, so that is a C++ double.  I thought it was
>> necessary to set the py::array_t data type since I think it's a template,
>> but I could be mistaken.
>>
>> Michael
>>
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Tuesday, May 19, 2020 7:46 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Excellent work!
>>
>> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>>
>> I also used k=160, so in this case we matched nicely. And the bunches of
>> 2^5 or 2^7 you were testing is exactly what I meant when referring to
>> batched inputs. So that's good news.
>>
>> I'll take a more careful look through the code -- there was something
>> with update using arrays of templated type A which was always cast to
>> double, for instance. But this is certainly promising.
>>
>>   jon
>>
>> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Great tests (especially with the ordering), Jon!
>>
>> I did some scaling tests for dimensionality (1, 10, and 100), and this is
>> where I think the Numpy version shows its benefits.  I performed a test
>> similar to your setup:
>> - each sketch has k = 160 (unsure what you used for this value, if it
>> matters)
>> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
>> - get_quantiles(0.5)
>>
>> d=1    -- 84 s (this is the 123 s case you ran)
>> d=10   -- 88 s
>> d=100  --  294 s
>> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
>> in runtime)
>>
>> Note that I did not use a single-value method, just the Numpy version.
>> Also, I checked the compute cost of the Python loop, and it's about 1
>> second, so most of that ~80 seconds is the communication between Python and
>> C++.  The scaling relation looks to be better than linear, but there needs
>> to be a few more tests here to really determine that.
>>
>> But, as Lee pointed out, there is non-negligible overhead from crossing
>> the bridge between Python and C++.  It's small, but when doing it 2^25
>> times it adds up.  The Numpy implementation allows you to cross that bridge
>> much less often, albeit at the cost of some extra time programming that
>> part.  If I set up a queue that holds 2^5 values and then updates it, it's
>> quite a bit better.  Here are the results for the same dimensions as before:
>>
>> d=1   -- 8 s
>> d=10  -- 31 s
>> d=100 -- 257 s
>>
>> So, even with a small queue of 32 values, we see that a single sketch
>> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
>> with the batch set to 2^7 values (this is how I use it in my project):
>> d=1   -- 4.2 s
>> d=10  -- 27 s
>> d=100 -- 251 s
>>
>> The speed gain doesn't seem to scale with dimensionality, but I think
>> that has more to do with the compute overhead of generating the data since
>> Numpy tends to be faster when working in 1D vs multiple dimensions.  But we
>> can see that it's possible to get runtimes much closer to C++ runtimes than
>> would be expected.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Tuesday, May 19, 2020 4:58 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Well, one thought was maybe we could always use the vectorized kll in
>> python and make it (relatively) easy to have it work with only 1 dimension.
>> It looks like there's still a non-trivial performance hit from that. But
>> wow.. I realized I could try something simple like reversing the
>> declaration order of single-update vs vector-update in the wrapper class.
>> And that dropped it to 35s!
>>
>> With that, it may be worth exploring a unified wrapper that handles
>> single items or vectors.
>>
>>   jon
>>
>> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>>
>> We had a similar issue in Java trying to use JNI to access C code.  Every
>> transition across the "boundary" between Java and C took from 10 to 100
>> microseconds.  This made the JNI option pretty useless from our
>> standpoint.
>>
>> I don't know python that well, but I could well imagine that there may be
>> a similar issue here in moving data between Python and C++.
>>
>> That being said, compared to brute-force computation of these types of
>> queries in Python vs using even these (what we consider slow performing)
>> sketches in Python still may be a huge win.
>>
>> Lee.
>>
>>
>>
>> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>>
>> I tried comparing the performance of the existing floats sketch vs the
>> new thing with a single dimension. And then I made a second update method
>> that handles a single item rather than creating an array of length 1 each
>> time. Otherwise, the scripts were as identical as possible. I fed in 2^25
>> gaussian-distributed values and queried for the mean to force some
>> computation on the sketch. I think get_quantile(0.5) vs
>> get_quantiles(0.5)[0][0] was the only difference,
>>
>> Existing kll_floats_sketch: 31s
>> kll_floatarray_sketches: 123s
>> with single-item update: 80s
>>
>> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse
>> RNG so this seemed more fair)
>>
>> I didn't try anything with trying to batch updates, even though in theory
>> the new object can support that. This was more a test to see the
>> performance impact of using it for all kll sketches.
>>
>> At some level, if you're already ok taking the speed hit for python vs
>> C++ then maybe it doesn't matter. But >2x still seems significant to me.
>>
>>   jon
>>
>> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Great, I'll be submitting the pull request shortly.  The codebase I'm
>> working with doesn't have any of the changes made in the past week or so,
>> hopefully that isn't too much of a hassle to merge.
>>
>> As an aside, my employer encourages us to contribute code to libraries
>> like this, so I'm happy to work on additional features for the Python
>> interface as needed.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Thursday, May 14, 2020 6:56 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> We've been polishing things up for a release, so that was one of several
>> things that we fixed over the last several days. Thank you for finding it!
>>
>> Anyway, if you're generally happy with the state of things (and are
>> allowed to under any employment terms), I'd encourage you to create pull
>> request to merge your changes into the main repo. It doesn't need to be
>> perfect as we can always make changes as part of the PR review or
>> post-merge.
>>
>> Thanks,
>>   jon
>>
>>
>> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Thanks for taking a look, Jon.
>>
>> I pushed an update that address 2 & 4.
>>
>> #3 is actually something I had a question about. I've tested passing
>> numpy.nan into the update function, and it doesn't appear to break anything
>> (min, max, etc all still work correctly).  However, the reported number of
>> items per sketch counts the nan entries.  Is this the expected behavior, or
>> should the get_n() method return a number that does not count the nans it
>> has seen?  I expected the latter, so I'm worried that numpy's nan is being
>> treated differently.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Monday, May 11, 2020 4:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> I didn't look in super close detail, but the code overall looks pretty
>> good. Comments are below.
>>
>> Note that not all of these necessarily need changes or replies. I'm just
>> trying to document things we'll want to think about for keeping the library
>> general-purpose (and we can always make changes after merging, of course).
>>
>> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
>> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
>> to operate on an entire vector at a time (vs treating each dimension
>> independently) that'd become confusing. I think an inherently vectorized
>> version would be a very different beast, but I always worry I'm not being
>> imaginative enough. If merging into the Apache codebase, I'd probably wait
>> to see what the file looks like with the renaming before a final decision
>> on moving to its own file.
>>
>> 2. What happens if the input to update() has >2 dimensions? If that'd be
>> invalid, we should explicitly check and complain. If it'll Do The Right
>> Thing by operating on the first 2 dimensions (meaning correct indices)
>> that's fine, but otherwise should probably complain.
>>
>> 3. Can this handle sparse input vectors? Not sure how important that is
>> in general, even if your project doesn't require it. kll_sketch will ignore
>> NaNs, so those appearing would mean the number of items per sketch can
>> already differ.
>>
>> 4. I'd probably eat the very slightly increased space and go with 32 bits
>> for the number of dimensions (aka number of sketches). If trying to look at
>> a distribution of values for some machine learning application, it'd be
>> easy to overflow 65k dimensions for some tasks.
>>
>> 5. I imagine you've realized that it's easiest to do unit tests from
>> python in this case. That's another advantage of having this live in the
>> wrapper.
>>
>> 6. Finally, that assert issue is already obsolete :). Asserts were
>> converted if/throw exceptions late last week. It'll be flagged as a
>> conflict in merging, so no worries for now.
>>
>> Looking good at this point. And as I said, not all of these need changes
>> or comments from you.
>>
>>   jon
>>
>> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
>> file -- I'll leave it to you to decide if it's better as its own file.
>>
>> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
>> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
>> an include of assert.h there and then it compiled without issue.  It's
>> possible that other compilers will also complain about that, so maybe this
>> is a good update to the main branch.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 10:47 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> My only comment without having looked at actual code is that the new
>> class would be more appropriate in the python wrapper. Maybe even drop it
>> in as it's own file, as that would decrease recompile time a bit when
>> debugging (that's pybind's suggestion, anyway). Probably not a huge
>> difference with how light these wrappers are.
>>
>> If this is something that becomes widely used, to where we look at
>> pushing it into the base library, we'd look at whether we could share any
>> data across sketches. But we're far from that point currently. It'd be nice
>> to need to consider that.
>>
>>   jon
>>
>> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,  this has been a great interchange and certainly will allow us
>> to move forward more quickly.
>>
>> Thank you for working on this on a Mother's Day Sunday!
>>
>> I'm sure Alex and Jon may have more questions, when they get a chance to
>> look at it starting tomorrow.
>>
>> Cheers, and be safe and well!
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Re: testing, so far I've just done glorified unit tests for uniform and
>> normal distributions of varying sizes.  I plan to do some timing tests vs
>> the existing single-sketch Python class to see how it compares for 1, 10,
>> and 100 streams.
>>
>> 1. That makes sense.  One option to allow full Numpy compatibility but
>> without requiring a Python user to use Numpy would be to return everything
>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>> lists into arrays, and non-Numpy users would be unaffected (aside from
>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>> set when instantiating the object that would control whether things are
>> returned as lists or arrays, though this still requires the numpy.h header
>> file.
>>
>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>> class called kll_sketches, which spawns a user-specified number of
>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>> the existing code for that.  For fast execution in Python, the parallel
>> sketches must be spawned in C++, but the existing Python object could only
>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>
>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>> todo's, and that is one of them -- the plan is to do like you described and
>> call the relevant kll_sketch method on each of the sketches and return that
>> to Python in a sensible format.  For deserialization, it would just iterate
>> through them and load them into the kll_sketches object.  I don't require
>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>> sometime this week after I finish my work for the day, shouldn't take long
>> to do.
>>
>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>> thought is that since under the hood everything is using the existing
>> kll_sketch class, it would have full compatibility with the rest of the
>> library (once SerDe is added in).
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 8:42 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
>> a closer look this next week.  They wrote this code so they are much closer
>> to it than I.
>>
>> What you have done so far makes sense for you as you want to get this
>> working in the NumPy environment as quickly as possible.  As soon as we
>> start thinking about incorporating this into our library other concerns
>> become important.
>>
>> 1. Adding API calls is the recommended way to add functionality (like
>> NumPy) to a library.  We cannot change API calls in a way that is only
>> useful with NumPy, because it would seriously impact other users of the
>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>> exist in the same sketch API, then we need to consider other alternatives.
>>
>> 2.  Based on our previous discussions, I didn't envision that you would
>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>> class that enables vectorized input to a vector of sketches and a
>> vectorized get result that creates a vector result from a vector of
>> sketches.  This would isolate the changes you need for NumPy from the
>> sketch itself.  This is also much easier to support, maintain and debug.
>>
>> 3. If you don't change the internals of the sketch then SerDe becomes
>> pretty straightforward. I don't know if you need a single serialization
>> that represents a full vector of sketches,  but if you do, then I would
>> just iterate over the individual serdes and figure out how to package it.
>> I really don't think you want to have to rewrite this low-level stuff.
>>
>> 4. Binary compatibility is critically important for us and I think will
>> be important for you as well.  There are two dimensions of binary
>> compatibility: history and language.  This means that a kll sketch
>> serialized from Java, can be successfully read by C++ and visa versa.
>> Similarly, a kll sketch serialized today will be able to be read many years
>> from now.     Another aspect of this would mean being able to collect, say,
>> 100 sketches that were not created using the NumPy version, and being able
>> to put them together in a NumPy vector; and visa versa.
>>
>> I hope all of this make sense to you.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,
>> This is great!  What testing have you been able to do so far?
>>
>>
>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194053022&sdata=nk7yvYeKH0qPbmD6nurLstg%2BHHTcIEayynm6z%2Bo0aQA%3D&reserved=0>
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194053022&sdata=Zur7z6me%2BIsfudWHm5zS99KC%2B4nnGXjD5Ne5S04%2F%2Fdc%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194063018&sdata=wKnu66Tgdl3IOc0PHotVhv6KnMKF2jOz66Br6IognHc%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> dev@datasketches.apache.org email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> dev-subscribe@datasketches.apache.org.
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <jo...@gmail.com>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <de...@datasketches.apache.org>
>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194063018&sdata=TQGPypN%2BKdRd0Pq5m9wgXP3dyC8j3HWWbx3QrA9B3Tg%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194073008&sdata=kXykJAHOoogQFo0oWrwzs%2FTEZPRN3%2F1Azd7dFkUrGW8%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Thanks for that!

We discussed things a bit on the ASF slack dev channel (datasketches-dev)
and we'll go with vector_of_kll_sketches as the c++ object name. Probably
something similar in python. So gotta do that, and then clean up unit test
names. But it's in pretty good shape so far.

  jon

On Tue, May 26, 2020 at 7:57 AM Michael Himes <mh...@knights.ucf.edu>
wrote:

> That's a great motto to code by!
>
> I adapted the existing unit tests for kll_sketch to work for the new
> kll_sketches class, and everything seems to be working as intended.  Some
> things are not implemented -- merging, and the normalized_rank_error method
> (note that there is the get_normalized_rank_error static method) -- and are
> therefore not tested.  Once they are implemented into the class, then those
> tests can be added.
>
> I've submitted a pull request, let me know if there are any other tests
> you'd like before it's considered tested & working.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 1:53 AM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I think it now works for quantiles, rank, pmf, and cdf.
>
> This exercise is a good example of why my colleague operates by the motto
> that if it isn't tested, it's broken. In very much related news, we need
> unit tests for this thing, in either C++ or python (probably the latter
> unless we move it into the core C++ part of the repo).
>
>   jon
>
> On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Ah gosh, that was silly on my part.
>
> So, I ran the previous code without that silly mistake, then called
> kll.get_quantiles(0.5) and it threw this error:
>
> TypeError: get_quantiles(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floatarray_sketches, fractions:
> List[float], isk: numpy.ndarray[int32] = -1) -> array
>
> Invoked with: <datasketches.kll_floatarray_sketches object at
> 0x7f610ce7de30>, 0.5
>
> I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and
> it throws this error:
>
> ValueError: array has incorrect number of dimensions: 0; expected 1
>
> This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the
> Numpy array equivalent, even though it has 1 dimension, not 0.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 25, 2020 4:53 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> That's the range() command complaining -- 1e6 is a float, but range wants
> an int. It worked if I instead changed the line to
> for i in range(int(1e6)):
>
>   jon
>
> On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Jon,
>
> Just got around to testing it out.  Maybe I am doing something wrong here,
> but I can't get the code to work correctly.  Here's the code:
>
> import numpy as np
> from datasketches import kll_floatarray_sketches
> k = 160
> d = 3
> kll = kll_floatarray_sketches(k, d)
> for i in range(1e6):
>   kll.update(np.random.randn(d))
>
> And here's the error:
>
> TypeError: 'float' object cannot be interpreted as an integer
>
> Seems like the inputs have changed, but the inputs in the code look pretty
> similar.  Can you point out what I'm doing wrong here?
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Friday, May 22, 2020 6:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
>
> My default is to treat an input vector x being treated as a column vector
> -- the generic quadratic form x't A x assumes that, for instance. But might
> be an engineering thing. Following your approach for now and eventually we
> can debate whether to transpose the matrix if one dimension matches the
> number of sketches in the object but not the expected one.
>
> Anyway, I looked more at the docs and see them using unchecked references
> (after doing a bounds check) so I switched to that, and then I added in a
> check for c-style vs fortran-style indexing so that I believe it'll have
> the inner loop over the native dimension. In theory it'll walk linearly
> through the matrix. That or I got it exactly backwards and am thrashing
> some cache level, one of the two :D
>
> If you have some time, please check out the branch and play with it for a
> bit to ensure it's still behaving as you expect. Then we can figure out
> some relevant unit tests,
>
>   jon
>
>
>
> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194043028&sdata=2MXHFn4gbKLwosvIWAvl%2FQtyt1fIrGkd40AAMW%2FxQg8%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194053022&sdata=nk7yvYeKH0qPbmD6nurLstg%2BHHTcIEayynm6z%2Bo0aQA%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194053022&sdata=Zur7z6me%2BIsfudWHm5zS99KC%2B4nnGXjD5Ne5S04%2F%2Fdc%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194063018&sdata=wKnu66Tgdl3IOc0PHotVhv6KnMKF2jOz66Br6IognHc%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194063018&sdata=TQGPypN%2BKdRd0Pq5m9wgXP3dyC8j3HWWbx3QrA9B3Tg%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194073008&sdata=kXykJAHOoogQFo0oWrwzs%2FTEZPRN3%2F1Azd7dFkUrGW8%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

That's a great motto to code by!

I adapted the existing unit tests for kll_sketch to work for the new kll_sketches class, and everything seems to be working as intended.  Some things are not implemented -- merging, and the normalized_rank_error method (note that there is the get_normalized_rank_error static method) -- and are therefore not tested.  Once they are implemented into the class, then those tests can be added.

I've submitted a pull request, let me know if there are any other tests you'd like before it's considered tested & working.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Tuesday, May 26, 2020 1:53 AM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto that if it isn't tested, it's broken. In very much related news, we need unit tests for this thing, in either C++ or python (probably the latter unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon



On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194043028&sdata=2MXHFn4gbKLwosvIWAvl%2FQtyt1fIrGkd40AAMW%2FxQg8%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194053022&sdata=nk7yvYeKH0qPbmD6nurLstg%2BHHTcIEayynm6z%2Bo0aQA%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194053022&sdata=Zur7z6me%2BIsfudWHm5zS99KC%2B4nnGXjD5Ne5S04%2F%2Fdc%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194063018&sdata=wKnu66Tgdl3IOc0PHotVhv6KnMKF2jOz66Br6IognHc%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194063018&sdata=TQGPypN%2BKdRd0Pq5m9wgXP3dyC8j3HWWbx3QrA9B3Tg%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C56a4a53955e44be59c5f08d80139214c%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260692194073008&sdata=kXykJAHOoogQFo0oWrwzs%2FTEZPRN3%2F1Azd7dFkUrGW8%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

I think it now works for quantiles, rank, pmf, and cdf.

This exercise is a good example of why my colleague operates by the motto
that if it isn't tested, it's broken. In very much related news, we need
unit tests for this thing, in either C++ or python (probably the latter
unless we move it into the core C++ part of the repo).

  jon

On Mon, May 25, 2020 at 2:06 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Ah gosh, that was silly on my part.
>
> So, I ran the previous code without that silly mistake, then called
> kll.get_quantiles(0.5) and it threw this error:
>
> TypeError: get_quantiles(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floatarray_sketches, fractions:
> List[float], isk: numpy.ndarray[int32] = -1) -> array
>
> Invoked with: <datasketches.kll_floatarray_sketches object at
> 0x7f610ce7de30>, 0.5
>
> I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and
> it throws this error:
>
> ValueError: array has incorrect number of dimensions: 0; expected 1
>
> This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the
> Numpy array equivalent, even though it has 1 dimension, not 0.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 25, 2020 4:53 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> That's the range() command complaining -- 1e6 is a float, but range wants
> an int. It worked if I instead changed the line to
> for i in range(int(1e6)):
>
>   jon
>
> On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Jon,
>
> Just got around to testing it out.  Maybe I am doing something wrong here,
> but I can't get the code to work correctly.  Here's the code:
>
> import numpy as np
> from datasketches import kll_floatarray_sketches
> k = 160
> d = 3
> kll = kll_floatarray_sketches(k, d)
> for i in range(1e6):
>   kll.update(np.random.randn(d))
>
> And here's the error:
>
> TypeError: 'float' object cannot be interpreted as an integer
>
> Seems like the inputs have changed, but the inputs in the code look pretty
> similar.  Can you point out what I'm doing wrong here?
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Friday, May 22, 2020 6:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
>
> My default is to treat an input vector x being treated as a column vector
> -- the generic quadratic form x't A x assumes that, for instance. But might
> be an engineering thing. Following your approach for now and eventually we
> can debate whether to transpose the matrix if one dimension matches the
> number of sketches in the object but not the expected one.
>
> Anyway, I looked more at the docs and see them using unchecked references
> (after doing a bounds check) so I switched to that, and then I added in a
> check for c-style vs fortran-style indexing so that I believe it'll have
> the inner loop over the native dimension. In theory it'll walk linearly
> through the matrix. That or I got it exactly backwards and am thrashing
> some cache level, one of the two :D
>
> If you have some time, please check out the branch and play with it for a
> bit to ensure it's still behaving as you expect. Then we can figure out
> some relevant unit tests,
>
>   jon
>
>
>
> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595346163&sdata=cAoelCePh2ounQGik7cDQ7uXZc%2FlJ46IuigDU6YJOT0%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595346163&sdata=97%2BnkV7p3zYdSLDLgiFG1Li5ffLwu5UfgaM4CosRBAc%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595356158&sdata=uZz0zKr9tdm8FYc4Zy5JgxOBezIXWVA7fxvkQWP5mwk%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595356158&sdata=VEFy8yKY14IMsBCkfqH4mi%2B6R8%2F9I3HVCUoGZf2IDEU%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595366151&sdata=rqR1CdmcibIZ0lEtv0LUmRS36Bm1A%2BNL4qCqhHkUjq4%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595366151&sdata=m3dnOv1KsHj4C8QW6WoDyNaJFnaQ8nT8qIfWsKlaup8%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Ah gosh, that was silly on my part.

So, I ran the previous code without that silly mistake, then called kll.get_quantiles(0.5) and it threw this error:

TypeError: get_quantiles(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floatarray_sketches, fractions: List[float], isk: numpy.ndarray[int32] = -1) -> array

Invoked with: <datasketches.kll_floatarray_sketches object at 0x7f610ce7de30>, 0.5

I also tried kll.get_quantiles([0.5]) and the Numpy array equivalent, and it throws this error:

ValueError: array has incorrect number of dimensions: 0; expected 1

This error happens even when I do kll.get_quantiles([0.5, 0.7]) or the Numpy array equivalent, even though it has 1 dimension, not 0.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Monday, May 25, 2020 4:53 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

That's the range() command complaining -- 1e6 is a float, but range wants an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon



On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595346163&sdata=cAoelCePh2ounQGik7cDQ7uXZc%2FlJ46IuigDU6YJOT0%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595346163&sdata=97%2BnkV7p3zYdSLDLgiFG1Li5ffLwu5UfgaM4CosRBAc%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595356158&sdata=uZz0zKr9tdm8FYc4Zy5JgxOBezIXWVA7fxvkQWP5mwk%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595356158&sdata=VEFy8yKY14IMsBCkfqH4mi%2B6R8%2F9I3HVCUoGZf2IDEU%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595366151&sdata=rqR1CdmcibIZ0lEtv0LUmRS36Bm1A%2BNL4qCqhHkUjq4%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C006409205a7547a60c1908d800edc94d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637260368595366151&sdata=m3dnOv1KsHj4C8QW6WoDyNaJFnaQ8nT8qIfWsKlaup8%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

That's the range() command complaining -- 1e6 is a float, but range wants
an int. It worked if I instead changed the line to
for i in range(int(1e6)):

  jon

On Mon, May 25, 2020 at 1:36 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Hi Jon,
>
> Just got around to testing it out.  Maybe I am doing something wrong here,
> but I can't get the code to work correctly.  Here's the code:
>
> import numpy as np
> from datasketches import kll_floatarray_sketches
> k = 160
> d = 3
> kll = kll_floatarray_sketches(k, d)
> for i in range(1e6):
>   kll.update(np.random.randn(d))
>
> And here's the error:
>
> TypeError: 'float' object cannot be interpreted as an integer
>
> Seems like the inputs have changed, but the inputs in the code look pretty
> similar.  Can you point out what I'm doing wrong here?
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Friday, May 22, 2020 6:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
>
> My default is to treat an input vector x being treated as a column vector
> -- the generic quadratic form x't A x assumes that, for instance. But might
> be an engineering thing. Following your approach for now and eventually we
> can debate whether to transpose the matrix if one dimension matches the
> number of sketches in the object but not the expected one.
>
> Anyway, I looked more at the docs and see them using unchecked references
> (after doing a bounds check) so I switched to that, and then I added in a
> check for c-style vs fortran-style indexing so that I believe it'll have
> the inner loop over the native dimension. In theory it'll walk linearly
> through the matrix. That or I got it exactly backwards and am thrashing
> some cache level, one of the two :D
>
> If you have some time, please check out the branch and play with it for a
> bit to ensure it's still behaving as you expect. Then we can figure out
> some relevant unit tests,
>
>   jon
>
>
>
> On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913825916&sdata=z%2FVd2xLJ2aH9JipvQ0VHajUzODAEXnoaWnmW8R0Uto8%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913825916&sdata=Me8q2EsZDD%2BXmnJeJGk0qhsEMKZY7%2B0XR6ZdvqWcjnU%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913835907&sdata=QJ%2FTBOfEitPXYUpF%2FRjwc9BmosxrY7io8lybyn6Bzoo%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913835907&sdata=q6q89j9%2FwSJxmyS3LFqf5KhcmNz9VeH4cD8kcgyxWq8%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913845906&sdata=7es%2BXHzYMBg6OC2zZrXcXrRznDi2E8O1LLaVOQ%2FXfD4%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913845906&sdata=y3OVl0O5S0cuf1PvIhZsSGqRcXpEbjVDFgjJlAxqBE0%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Hi Jon,

Just got around to testing it out.  Maybe I am doing something wrong here, but I can't get the code to work correctly.  Here's the code:

import numpy as np
from datasketches import kll_floatarray_sketches
k = 160
d = 3
kll = kll_floatarray_sketches(k, d)
for i in range(1e6):
  kll.update(np.random.randn(d))

And here's the error:

TypeError: 'float' object cannot be interpreted as an integer

Seems like the inputs have changed, but the inputs in the code look pretty similar.  Can you point out what I'm doing wrong here?

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Friday, May 22, 2020 6:21 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,

My default is to treat an input vector x being treated as a column vector -- the generic quadratic form x't A x assumes that, for instance. But might be an engineering thing. Following your approach for now and eventually we can debate whether to transpose the matrix if one dimension matches the number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references (after doing a bounds check) so I switched to that, and then I added in a check for c-style vs fortran-style indexing so that I believe it'll have the inner loop over the native dimension. In theory it'll walk linearly through the matrix. That or I got it exactly backwards and am thrashing some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a bit to ensure it's still behaving as you expect. Then we can figure out some relevant unit tests,

  jon



On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913825916&sdata=z%2FVd2xLJ2aH9JipvQ0VHajUzODAEXnoaWnmW8R0Uto8%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913825916&sdata=Me8q2EsZDD%2BXmnJeJGk0qhsEMKZY7%2B0XR6ZdvqWcjnU%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913835907&sdata=QJ%2FTBOfEitPXYUpF%2FRjwc9BmosxrY7io8lybyn6Bzoo%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913835907&sdata=q6q89j9%2FwSJxmyS3LFqf5KhcmNz9VeH4cD8kcgyxWq8%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913845906&sdata=7es%2BXHzYMBg6OC2zZrXcXrRznDi2E8O1LLaVOQ%2FXfD4%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C04237cc5227748568fc608d7fe9e7844%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637257828913845906&sdata=y3OVl0O5S0cuf1PvIhZsSGqRcXpEbjVDFgjJlAxqBE0%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Hi Michael,

My default is to treat an input vector x being treated as a column vector
-- the generic quadratic form x't A x assumes that, for instance. But might
be an engineering thing. Following your approach for now and eventually we
can debate whether to transpose the matrix if one dimension matches the
number of sketches in the object but not the expected one.

Anyway, I looked more at the docs and see them using unchecked references
(after doing a bounds check) so I switched to that, and then I added in a
check for c-style vs fortran-style indexing so that I believe it'll have
the inner loop over the native dimension. In theory it'll walk linearly
through the matrix. That or I got it exactly backwards and am thrashing
some cache level, one of the two :D

If you have some time, please check out the branch and play with it for a
bit to ensure it's still behaving as you expect. Then we can figure out
some relevant unit tests,

  jon



On Fri, May 22, 2020 at 7:06 AM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Jon,
>
> Those changes sound great, as long as the data is being accessed
> correctly. The pybind docs warn about accessing data through the array_t
> object since it's not guaranteed to be contiguous in memory.  Typically,
> they demonstrate accessing it through the buffer, which I followed.  But if
> this is an unnecessary step, then great.
>
> As for the 2D case, here is my line of thinking.  For 1D, we have a single
> row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I
> believe that is how I coded it, but it's possible I flipped the dimensions.
>
> Michael
>
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 21, 2020 7:17 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> I've restructured the object to be an actual C++ object with proper
> methods. And then I've gotten rid of all the casts to buffer in favor of
> just using the py::array_t<> that's passed in. Tha removes casting
> everything to double, and allows for range checks. Now an attempt to access
> sketch 7 in a 5-d array doesn't just segfault :)
>
> Looking at pybind docs a bit more, it seems there are no hard guarantees
> on data layout in memory with numpy arrays -- if you transpose one, walking
> through with a pointer will return items In the wrong order. So update()
> ends up using items.at
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923203976&sdata=OrHCl04W%2FyF6avkjtfXJanLUlfWlJF8H2CJQgrRLT5U%3D&reserved=0>()
> instead (more on that in a moment). The whole thing is probably also
> copying values around more than necessary. Anyway, we can look at ways to
> optimize such things eventually, but for now I'm working on ensuring
> correctness and at least somewhat graceful failure.
>
> Anyway, item input order. If we have 1-d input, we implicitly assume we
> want d updates, one for each dimension in the object. It seems like the
> default for numpy is row-major order, which makes sense given C beneath the
> hood. But for inputting n points at a time, do you expect the matrix to be
> (d x n) or (n x d)?
>
>   jon
>
> On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923203976&sdata=R2SSgvAjMJqjxNQRsYvckSgGAmovuaVl%2F%2FlISMVywBw%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923213969&sdata=6iB9Jp4wofvo6Fg5fqaBjnwl4WNhG22eNP1wss5Vdt4%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923213969&sdata=YkJXfoGHW%2FcbUZ1cdvsZS7M1pRFOQFrjvWqUcIUhbR8%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923223962&sdata=FvoInTMEeuxjEZpixah9QsUXMNe2AjVTxMWP3bQgMtQ%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923233957&sdata=OnENDGp2a7S3ELiVTNaMrCSVMOrdswh84xZoc8dfUpE%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Jon,

Those changes sound great, as long as the data is being accessed correctly. The pybind docs warn about accessing data through the array_t object since it's not guaranteed to be contiguous in memory.  Typically, they demonstrate accessing it through the buffer, which I followed.  But if this is an unnecessary step, then great.

As for the 2D case, here is my line of thinking.  For 1D, we have a single row with d values.  So for 2D, we'd have n rows with d values, (n x d).  I believe that is how I coded it, but it's possible I flipped the dimensions.

Michael

________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Thursday, May 21, 2020 7:17 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fitems.at%2F&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923203976&sdata=OrHCl04W%2FyF6avkjtfXJanLUlfWlJF8H2CJQgrRLT5U%3D&reserved=0>() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923203976&sdata=R2SSgvAjMJqjxNQRsYvckSgGAmovuaVl%2F%2FlISMVywBw%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923213969&sdata=6iB9Jp4wofvo6Fg5fqaBjnwl4WNhG22eNP1wss5Vdt4%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923213969&sdata=YkJXfoGHW%2FcbUZ1cdvsZS7M1pRFOQFrjvWqUcIUhbR8%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923223962&sdata=FvoInTMEeuxjEZpixah9QsUXMNe2AjVTxMWP3bQgMtQ%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cc29714a9e56a42ac359a08d7fddd3952%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637256998923233957&sdata=OnENDGp2a7S3ELiVTNaMrCSVMOrdswh84xZoc8dfUpE%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Michael,

I've restructured the object to be an actual C++ object with proper
methods. And then I've gotten rid of all the casts to buffer in favor of
just using the py::array_t<> that's passed in. Tha removes casting
everything to double, and allows for range checks. Now an attempt to access
sketch 7 in a 5-d array doesn't just segfault :)

Looking at pybind docs a bit more, it seems there are no hard guarantees on
data layout in memory with numpy arrays -- if you transpose one, walking
through with a pointer will return items In the wrong order. So update()
ends up using items.at() instead (more on that in a moment). The whole
thing is probably also copying values around more than necessary. Anyway,
we can look at ways to optimize such things eventually, but for now I'm
working on ensuring correctness and at least somewhat graceful failure.

Anyway, item input order. If we have 1-d input, we implicitly assume we
want d updates, one for each dimension in the object. It seems like the
default for numpy is row-major order, which makes sense given C beneath the
hood. But for inputting n points at a time, do you expect the matrix to be
(d x n) or (n x d)?

  jon

On Tue, May 19, 2020, 5:20 PM Michael Himes <mh...@knights.ucf.edu> wrote:

> Re: the template type A, I set that for the Python array data type.  A
> Python float is 64 bits, so that is a C++ double.  I thought it was
> necessary to set the py::array_t data type since I think it's a template,
> but I could be mistaken.
>
> Michael
>
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 7:46 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Excellent work!
>
> On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231864264&sdata=SJp244g4tj6d%2B1Z0jS1uldEs7VUuKjyNJrO8MpGZtkE%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231874261&sdata=eVujcwbqKNxloXDHf5nzcAuMBjCCg3ZRDqfM09SsCRk%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231874261&sdata=mcLZSrrXzoREYjCQnhSg1GAfPSeJpocrJwBQmVJJzLQ%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231884256&sdata=Pq5xyLFYHKAVnD1pUFMy9VpCZ4ceDUKv34f8Afea%2B%2F0%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231884256&sdata=c4zyquWhfPR3ITQXOF7rGefRz4SWgxI3MT3a3d%2Bvyww%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Re: the template type A, I set that for the Python array data type.  A Python float is 64 bits, so that is a C++ double.  I thought it was necessary to set the py::array_t data type since I think it's a template, but I could be mistaken.

Michael

________________________________
From: leerho <le...@gmail.com>
Sent: Tuesday, May 19, 2020 7:46 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com>> wrote:
I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.

On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231864264&sdata=SJp244g4tj6d%2B1Z0jS1uldEs7VUuKjyNJrO8MpGZtkE%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231874261&sdata=eVujcwbqKNxloXDHf5nzcAuMBjCCg3ZRDqfM09SsCRk%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231874261&sdata=mcLZSrrXzoREYjCQnhSg1GAfPSeJpocrJwBQmVJJzLQ%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231884256&sdata=Pq5xyLFYHKAVnD1pUFMy9VpCZ4ceDUKv34f8Afea%2B%2F0%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231884256&sdata=c4zyquWhfPR3ITQXOF7rGefRz4SWgxI3MT3a3d%2Bvyww%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <jo...@gmail.com> wrote:

> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Great tests (especially with the ordering), Jon!
>>
>> I did some scaling tests for dimensionality (1, 10, and 100), and this is
>> where I think the Numpy version shows its benefits.  I performed a test
>> similar to your setup:
>> - each sketch has k = 160 (unsure what you used for this value, if it
>> matters)
>> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
>> - get_quantiles(0.5)
>>
>> d=1    -- 84 s (this is the 123 s case you ran)
>> d=10   -- 88 s
>> d=100  --  294 s
>> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
>> in runtime)
>>
>> Note that I did not use a single-value method, just the Numpy version.
>> Also, I checked the compute cost of the Python loop, and it's about 1
>> second, so most of that ~80 seconds is the communication between Python and
>> C++.  The scaling relation looks to be better than linear, but there needs
>> to be a few more tests here to really determine that.
>>
>> But, as Lee pointed out, there is non-negligible overhead from crossing
>> the bridge between Python and C++.  It's small, but when doing it 2^25
>> times it adds up.  The Numpy implementation allows you to cross that bridge
>> much less often, albeit at the cost of some extra time programming that
>> part.  If I set up a queue that holds 2^5 values and then updates it, it's
>> quite a bit better.  Here are the results for the same dimensions as before:
>>
>> d=1   -- 8 s
>> d=10  -- 31 s
>> d=100 -- 257 s
>>
>> So, even with a small queue of 32 values, we see that a single sketch
>> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
>> with the batch set to 2^7 values (this is how I use it in my project):
>> d=1   -- 4.2 s
>> d=10  -- 27 s
>> d=100 -- 251 s
>>
>> The speed gain doesn't seem to scale with dimensionality, but I think
>> that has more to do with the compute overhead of generating the data since
>> Numpy tends to be faster when working in 1D vs multiple dimensions.  But we
>> can see that it's possible to get runtimes much closer to C++ runtimes than
>> would be expected.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Tuesday, May 19, 2020 4:58 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Well, one thought was maybe we could always use the vectorized kll in
>> python and make it (relatively) easy to have it work with only 1 dimension.
>> It looks like there's still a non-trivial performance hit from that. But
>> wow.. I realized I could try something simple like reversing the
>> declaration order of single-update vs vector-update in the wrapper class.
>> And that dropped it to 35s!
>>
>> With that, it may be worth exploring a unified wrapper that handles
>> single items or vectors.
>>
>>   jon
>>
>> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>>
>> We had a similar issue in Java trying to use JNI to access C code.  Every
>> transition across the "boundary" between Java and C took from 10 to 100
>> microseconds.  This made the JNI option pretty useless from our
>> standpoint.
>>
>> I don't know python that well, but I could well imagine that there may be
>> a similar issue here in moving data between Python and C++.
>>
>> That being said, compared to brute-force computation of these types of
>> queries in Python vs using even these (what we consider slow performing)
>> sketches in Python still may be a huge win.
>>
>> Lee.
>>
>>
>>
>> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>>
>> I tried comparing the performance of the existing floats sketch vs the
>> new thing with a single dimension. And then I made a second update method
>> that handles a single item rather than creating an array of length 1 each
>> time. Otherwise, the scripts were as identical as possible. I fed in 2^25
>> gaussian-distributed values and queried for the mean to force some
>> computation on the sketch. I think get_quantile(0.5) vs
>> get_quantiles(0.5)[0][0] was the only difference,
>>
>> Existing kll_floats_sketch: 31s
>> kll_floatarray_sketches: 123s
>> with single-item update: 80s
>>
>> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse
>> RNG so this seemed more fair)
>>
>> I didn't try anything with trying to batch updates, even though in theory
>> the new object can support that. This was more a test to see the
>> performance impact of using it for all kll sketches.
>>
>> At some level, if you're already ok taking the speed hit for python vs
>> C++ then maybe it doesn't matter. But >2x still seems significant to me.
>>
>>   jon
>>
>> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Great, I'll be submitting the pull request shortly.  The codebase I'm
>> working with doesn't have any of the changes made in the past week or so,
>> hopefully that isn't too much of a hassle to merge.
>>
>> As an aside, my employer encourages us to contribute code to libraries
>> like this, so I'm happy to work on additional features for the Python
>> interface as needed.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Thursday, May 14, 2020 6:56 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> We've been polishing things up for a release, so that was one of several
>> things that we fixed over the last several days. Thank you for finding it!
>>
>> Anyway, if you're generally happy with the state of things (and are
>> allowed to under any employment terms), I'd encourage you to create pull
>> request to merge your changes into the main repo. It doesn't need to be
>> perfect as we can always make changes as part of the PR review or
>> post-merge.
>>
>> Thanks,
>>   jon
>>
>>
>> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Thanks for taking a look, Jon.
>>
>> I pushed an update that address 2 & 4.
>>
>> #3 is actually something I had a question about. I've tested passing
>> numpy.nan into the update function, and it doesn't appear to break anything
>> (min, max, etc all still work correctly).  However, the reported number of
>> items per sketch counts the nan entries.  Is this the expected behavior, or
>> should the get_n() method return a number that does not count the nans it
>> has seen?  I expected the latter, so I'm worried that numpy's nan is being
>> treated differently.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Monday, May 11, 2020 4:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> I didn't look in super close detail, but the code overall looks pretty
>> good. Comments are below.
>>
>> Note that not all of these necessarily need changes or replies. I'm just
>> trying to document things we'll want to think about for keeping the library
>> general-purpose (and we can always make changes after merging, of course).
>>
>> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
>> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
>> to operate on an entire vector at a time (vs treating each dimension
>> independently) that'd become confusing. I think an inherently vectorized
>> version would be a very different beast, but I always worry I'm not being
>> imaginative enough. If merging into the Apache codebase, I'd probably wait
>> to see what the file looks like with the renaming before a final decision
>> on moving to its own file.
>>
>> 2. What happens if the input to update() has >2 dimensions? If that'd be
>> invalid, we should explicitly check and complain. If it'll Do The Right
>> Thing by operating on the first 2 dimensions (meaning correct indices)
>> that's fine, but otherwise should probably complain.
>>
>> 3. Can this handle sparse input vectors? Not sure how important that is
>> in general, even if your project doesn't require it. kll_sketch will ignore
>> NaNs, so those appearing would mean the number of items per sketch can
>> already differ.
>>
>> 4. I'd probably eat the very slightly increased space and go with 32 bits
>> for the number of dimensions (aka number of sketches). If trying to look at
>> a distribution of values for some machine learning application, it'd be
>> easy to overflow 65k dimensions for some tasks.
>>
>> 5. I imagine you've realized that it's easiest to do unit tests from
>> python in this case. That's another advantage of having this live in the
>> wrapper.
>>
>> 6. Finally, that assert issue is already obsolete :). Asserts were
>> converted if/throw exceptions late last week. It'll be flagged as a
>> conflict in merging, so no worries for now.
>>
>> Looking good at this point. And as I said, not all of these need changes
>> or comments from you.
>>
>>   jon
>>
>> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
>> file -- I'll leave it to you to decide if it's better as its own file.
>>
>> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
>> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
>> an include of assert.h there and then it compiled without issue.  It's
>> possible that other compilers will also complain about that, so maybe this
>> is a good update to the main branch.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 10:47 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> My only comment without having looked at actual code is that the new
>> class would be more appropriate in the python wrapper. Maybe even drop it
>> in as it's own file, as that would decrease recompile time a bit when
>> debugging (that's pybind's suggestion, anyway). Probably not a huge
>> difference with how light these wrappers are.
>>
>> If this is something that becomes widely used, to where we look at
>> pushing it into the base library, we'd look at whether we could share any
>> data across sketches. But we're far from that point currently. It'd be nice
>> to need to consider that.
>>
>>   jon
>>
>> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,  this has been a great interchange and certainly will allow us
>> to move forward more quickly.
>>
>> Thank you for working on this on a Mother's Day Sunday!
>>
>> I'm sure Alex and Jon may have more questions, when they get a chance to
>> look at it starting tomorrow.
>>
>> Cheers, and be safe and well!
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Re: testing, so far I've just done glorified unit tests for uniform and
>> normal distributions of varying sizes.  I plan to do some timing tests vs
>> the existing single-sketch Python class to see how it compares for 1, 10,
>> and 100 streams.
>>
>> 1. That makes sense.  One option to allow full Numpy compatibility but
>> without requiring a Python user to use Numpy would be to return everything
>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>> lists into arrays, and non-Numpy users would be unaffected (aside from
>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>> set when instantiating the object that would control whether things are
>> returned as lists or arrays, though this still requires the numpy.h header
>> file.
>>
>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>> class called kll_sketches, which spawns a user-specified number of
>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>> the existing code for that.  For fast execution in Python, the parallel
>> sketches must be spawned in C++, but the existing Python object could only
>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>
>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>> todo's, and that is one of them -- the plan is to do like you described and
>> call the relevant kll_sketch method on each of the sketches and return that
>> to Python in a sensible format.  For deserialization, it would just iterate
>> through them and load them into the kll_sketches object.  I don't require
>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>> sometime this week after I finish my work for the day, shouldn't take long
>> to do.
>>
>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>> thought is that since under the hood everything is using the existing
>> kll_sketch class, it would have full compatibility with the rest of the
>> library (once SerDe is added in).
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 8:42 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
>> a closer look this next week.  They wrote this code so they are much closer
>> to it than I.
>>
>> What you have done so far makes sense for you as you want to get this
>> working in the NumPy environment as quickly as possible.  As soon as we
>> start thinking about incorporating this into our library other concerns
>> become important.
>>
>> 1. Adding API calls is the recommended way to add functionality (like
>> NumPy) to a library.  We cannot change API calls in a way that is only
>> useful with NumPy, because it would seriously impact other users of the
>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>> exist in the same sketch API, then we need to consider other alternatives.
>>
>> 2.  Based on our previous discussions, I didn't envision that you would
>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>> class that enables vectorized input to a vector of sketches and a
>> vectorized get result that creates a vector result from a vector of
>> sketches.  This would isolate the changes you need for NumPy from the
>> sketch itself.  This is also much easier to support, maintain and debug.
>>
>> 3. If you don't change the internals of the sketch then SerDe becomes
>> pretty straightforward. I don't know if you need a single serialization
>> that represents a full vector of sketches,  but if you do, then I would
>> just iterate over the individual serdes and figure out how to package it.
>> I really don't think you want to have to rewrite this low-level stuff.
>>
>> 4. Binary compatibility is critically important for us and I think will
>> be important for you as well.  There are two dimensions of binary
>> compatibility: history and language.  This means that a kll sketch
>> serialized from Java, can be successfully read by C++ and visa versa.
>> Similarly, a kll sketch serialized today will be able to be read many years
>> from now.     Another aspect of this would mean being able to collect, say,
>> 100 sketches that were not created using the NumPy version, and being able
>> to put them together in a NumPy vector; and visa versa.
>>
>> I hope all of this make sense to you.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,
>> This is great!  What testing have you been able to do so far?
>>
>>
>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0>
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> dev@datasketches.apache.org email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> dev-subscribe@datasketches.apache.org.
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <jo...@gmail.com>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <de...@datasketches.apache.org>
>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

I also used k=160, so in this case we matched nicely. And the bunches of
2^5 or 2^7 you were testing is exactly what I meant when referring to
batched inputs. So that's good news.

I'll take a more careful look through the code -- there was something with
update using arrays of templated type A which was always cast to double,
for instance. But this is certainly promising.

  jon

On Tue, May 19, 2020 at 3:32 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Great tests (especially with the ordering), Jon!
>
> I did some scaling tests for dimensionality (1, 10, and 100), and this is
> where I think the Numpy version shows its benefits.  I performed a test
> similar to your setup:
> - each sketch has k = 160 (unsure what you used for this value, if it
> matters)
> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
> - get_quantiles(0.5)
>
> d=1    -- 84 s (this is the 123 s case you ran)
> d=10   -- 88 s
> d=100  --  294 s
> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
> in runtime)
>
> Note that I did not use a single-value method, just the Numpy version.
> Also, I checked the compute cost of the Python loop, and it's about 1
> second, so most of that ~80 seconds is the communication between Python and
> C++.  The scaling relation looks to be better than linear, but there needs
> to be a few more tests here to really determine that.
>
> But, as Lee pointed out, there is non-negligible overhead from crossing
> the bridge between Python and C++.  It's small, but when doing it 2^25
> times it adds up.  The Numpy implementation allows you to cross that bridge
> much less often, albeit at the cost of some extra time programming that
> part.  If I set up a queue that holds 2^5 values and then updates it, it's
> quite a bit better.  Here are the results for the same dimensions as before:
>
> d=1   -- 8 s
> d=10  -- 31 s
> d=100 -- 257 s
>
> So, even with a small queue of 32 values, we see that a single sketch
> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
> with the batch set to 2^7 values (this is how I use it in my project):
> d=1   -- 4.2 s
> d=10  -- 27 s
> d=100 -- 251 s
>
> The speed gain doesn't seem to scale with dimensionality, but I think that
> has more to do with the compute overhead of generating the data since Numpy
> tends to be faster when working in 1D vs multiple dimensions.  But we can
> see that it's possible to get runtimes much closer to C++ runtimes than
> would be expected.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Tuesday, May 19, 2020 4:58 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Well, one thought was maybe we could always use the vectorized kll in
> python and make it (relatively) easy to have it work with only 1 dimension.
> It looks like there's still a non-trivial performance hit from that. But
> wow.. I realized I could try something simple like reversing the
> declaration order of single-update vs vector-update in the wrapper class.
> And that dropped it to 35s!
>
> With that, it may be worth exploring a unified wrapper that handles single
> items or vectors.
>
>   jon
>
> On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:
>
> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where I think the Numpy version shows its benefits.  I performed a test similar to your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)

d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I checked the compute cost of the Python loop, and it's about 1 second, so most of that ~80 seconds is the communication between Python and C++.  The scaling relation looks to be better than linear, but there needs to be a few more tests here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the bridge between Python and C++.  It's small, but when doing it 2^25 times it adds up.  The Numpy implementation allows you to cross that bridge much less often, albeit at the cost of some extra time programming that part.  If I set up a queue that holds 2^5 values and then updates it, it's quite a bit better.  Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has more to do with the compute overhead of generating the data since Numpy tends to be faster when working in 1D vs multiple dimensions.  But we can see that it's possible to get runtimes much closer to C++ runtimes than would be expected.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Tuesday, May 19, 2020 4:58 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Well, one thought was maybe we could always use the vectorized kll in python and make it (relatively) easy to have it work with only 1 dimension. It looks like there's still a non-trivial performance hit from that. But wow.. I realized I could try something simple like reversing the declaration order of single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every transition across the "boundary" between Java and C took from 10 to 100 microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries in Python vs using even these (what we consider slow performing) sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com>> wrote:
I tried comparing the performance of the existing floats sketch vs the new thing with a single dimension. And then I made a second update method that handles a single item rather than creating an array of length 1 each time. Otherwise, the scripts were as identical as possible. I fed in 2^25 gaussian-distributed values and queried for the mean to force some computation on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the new object can support that. This was more a test to see the performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

Well, one thought was maybe we could always use the vectorized kll in
python and make it (relatively) easy to have it work with only 1 dimension.
It looks like there's still a non-trivial performance hit from that. But
wow.. I realized I could try something simple like reversing the
declaration order of single-update vs vector-update in the wrapper class.
And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single
items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho <le...@gmail.com> wrote:

> We had a similar issue in Java trying to use JNI to access C code.  Every
> transition across the "boundary" between Java and C took from 10 to 100
> microseconds.  This made the JNI option pretty useless from our
> standpoint.
>
> I don't know python that well, but I could well imagine that there may be
> a similar issue here in moving data between Python and C++.
>
> That being said, compared to brute-force computation of these types of
> queries in Python vs using even these (what we consider slow performing)
> sketches in Python still may be a huge win.
>
> Lee.
>
>
>
> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:
>
>> I tried comparing the performance of the existing floats sketch vs the
>> new thing with a single dimension. And then I made a second update method
>> that handles a single item rather than creating an array of length 1 each
>> time. Otherwise, the scripts were as identical as possible. I fed in 2^25
>> gaussian-distributed values and queried for the mean to force some
>> computation on the sketch. I think get_quantile(0.5) vs
>> get_quantiles(0.5)[0][0] was the only difference,
>>
>> Existing kll_floats_sketch: 31s
>> kll_floatarray_sketches: 123s
>> with single-item update: 80s
>>
>> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse
>> RNG so this seemed more fair)
>>
>> I didn't try anything with trying to batch updates, even though in theory
>> the new object can support that. This was more a test to see the
>> performance impact of using it for all kll sketches.
>>
>> At some level, if you're already ok taking the speed hit for python vs
>> C++ then maybe it doesn't matter. But >2x still seems significant to me.
>>
>>   jon
>>
>> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>>> Great, I'll be submitting the pull request shortly.  The codebase I'm
>>> working with doesn't have any of the changes made in the past week or so,
>>> hopefully that isn't too much of a hassle to merge.
>>>
>>> As an aside, my employer encourages us to contribute code to libraries
>>> like this, so I'm happy to work on additional features for the Python
>>> interface as needed.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* Jon Malkin <jo...@gmail.com>
>>> *Sent:* Thursday, May 14, 2020 6:56 PM
>>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> We've been polishing things up for a release, so that was one of several
>>> things that we fixed over the last several days. Thank you for finding it!
>>>
>>> Anyway, if you're generally happy with the state of things (and are
>>> allowed to under any employment terms), I'd encourage you to create pull
>>> request to merge your changes into the main repo. It doesn't need to be
>>> perfect as we can always make changes as part of the PR review or
>>> post-merge.
>>>
>>> Thanks,
>>>   jon
>>>
>>>
>>> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Thanks for taking a look, Jon.
>>>
>>> I pushed an update that address 2 & 4.
>>>
>>> #3 is actually something I had a question about. I've tested passing
>>> numpy.nan into the update function, and it doesn't appear to break anything
>>> (min, max, etc all still work correctly).  However, the reported number of
>>> items per sketch counts the nan entries.  Is this the expected behavior, or
>>> should the get_n() method return a number that does not count the nans it
>>> has seen?  I expected the latter, so I'm worried that numpy's nan is being
>>> treated differently.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* Jon Malkin <jo...@gmail.com>
>>> *Sent:* Monday, May 11, 2020 4:32 PM
>>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> I didn't look in super close detail, but the code overall looks pretty
>>> good. Comments are below.
>>>
>>> Note that not all of these necessarily need changes or replies. I'm just
>>> trying to document things we'll want to think about for keeping the library
>>> general-purpose (and we can always make changes after merging, of course).
>>>
>>> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
>>> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
>>> to operate on an entire vector at a time (vs treating each dimension
>>> independently) that'd become confusing. I think an inherently vectorized
>>> version would be a very different beast, but I always worry I'm not being
>>> imaginative enough. If merging into the Apache codebase, I'd probably wait
>>> to see what the file looks like with the renaming before a final decision
>>> on moving to its own file.
>>>
>>> 2. What happens if the input to update() has >2 dimensions? If that'd be
>>> invalid, we should explicitly check and complain. If it'll Do The Right
>>> Thing by operating on the first 2 dimensions (meaning correct indices)
>>> that's fine, but otherwise should probably complain.
>>>
>>> 3. Can this handle sparse input vectors? Not sure how important that is
>>> in general, even if your project doesn't require it. kll_sketch will ignore
>>> NaNs, so those appearing would mean the number of items per sketch can
>>> already differ.
>>>
>>> 4. I'd probably eat the very slightly increased space and go with 32
>>> bits for the number of dimensions (aka number of sketches). If trying to
>>> look at a distribution of values for some machine learning application,
>>> it'd be easy to overflow 65k dimensions for some tasks.
>>>
>>> 5. I imagine you've realized that it's easiest to do unit tests from
>>> python in this case. That's another advantage of having this live in the
>>> wrapper.
>>>
>>> 6. Finally, that assert issue is already obsolete :). Asserts were
>>> converted if/throw exceptions late last week. It'll be flagged as a
>>> conflict in merging, so no worries for now.
>>>
>>> Looking good at this point. And as I said, not all of these need changes
>>> or comments from you.
>>>
>>>   jon
>>>
>>> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
>>> file -- I'll leave it to you to decide if it's better as its own file.
>>>
>>> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
>>> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
>>> an include of assert.h there and then it compiled without issue.  It's
>>> possible that other compilers will also complain about that, so maybe this
>>> is a good update to the main branch.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* Jon Malkin <jo...@gmail.com>
>>> *Sent:* Sunday, May 10, 2020 10:47 PM
>>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> My only comment without having looked at actual code is that the new
>>> class would be more appropriate in the python wrapper. Maybe even drop it
>>> in as it's own file, as that would decrease recompile time a bit when
>>> debugging (that's pybind's suggestion, anyway). Probably not a huge
>>> difference with how light these wrappers are.
>>>
>>> If this is something that becomes widely used, to where we look at
>>> pushing it into the base library, we'd look at whether we could share any
>>> data across sketches. But we're far from that point currently. It'd be nice
>>> to need to consider that.
>>>
>>>   jon
>>>
>>> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>>>
>>> Michael,  this has been a great interchange and certainly will allow us
>>> to move forward more quickly.
>>>
>>> Thank you for working on this on a Mother's Day Sunday!
>>>
>>> I'm sure Alex and Jon may have more questions, when they get a chance to
>>> look at it starting tomorrow.
>>>
>>> Cheers, and be safe and well!
>>>
>>> Lee.
>>>
>>> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Re: testing, so far I've just done glorified unit tests for uniform and
>>> normal distributions of varying sizes.  I plan to do some timing tests vs
>>> the existing single-sketch Python class to see how it compares for 1, 10,
>>> and 100 streams.
>>>
>>> 1. That makes sense.  One option to allow full Numpy compatibility but
>>> without requiring a Python user to use Numpy would be to return everything
>>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>>> lists into arrays, and non-Numpy users would be unaffected (aside from
>>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>>> set when instantiating the object that would control whether things are
>>> returned as lists or arrays, though this still requires the numpy.h header
>>> file.
>>>
>>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>>> class called kll_sketches, which spawns a user-specified number of
>>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>>> the existing code for that.  For fast execution in Python, the parallel
>>> sketches must be spawned in C++, but the existing Python object could only
>>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>>
>>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>>> todo's, and that is one of them -- the plan is to do like you described and
>>> call the relevant kll_sketch method on each of the sketches and return that
>>> to Python in a sensible format.  For deserialization, it would just iterate
>>> through them and load them into the kll_sketches object.  I don't require
>>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>>> sometime this week after I finish my work for the day, shouldn't take long
>>> to do.
>>>
>>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>>> thought is that since under the hood everything is using the existing
>>> kll_sketch class, it would have full compatibility with the rest of the
>>> library (once SerDe is added in).
>>>
>>> Michael
>>> ------------------------------
>>> *From:* leerho <le...@gmail.com>
>>> *Sent:* Sunday, May 10, 2020 8:42 PM
>>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> Thanks for the link to your code.  My colleagues, Jon and Alex, will
>>> take a closer look this next week.  They wrote this code so they are much
>>> closer to it than I.
>>>
>>> What you have done so far makes sense for you as you want to get this
>>> working in the NumPy environment as quickly as possible.  As soon as we
>>> start thinking about incorporating this into our library other concerns
>>> become important.
>>>
>>> 1. Adding API calls is the recommended way to add functionality (like
>>> NumPy) to a library.  We cannot change API calls in a way that is only
>>> useful with NumPy, because it would seriously impact other users of the
>>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>>> exist in the same sketch API, then we need to consider other alternatives.
>>>
>>> 2.  Based on our previous discussions, I didn't envision that you would
>>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>>> class that enables vectorized input to a vector of sketches and a
>>> vectorized get result that creates a vector result from a vector of
>>> sketches.  This would isolate the changes you need for NumPy from the
>>> sketch itself.  This is also much easier to support, maintain and debug.
>>>
>>> 3. If you don't change the internals of the sketch then SerDe becomes
>>> pretty straightforward. I don't know if you need a single serialization
>>> that represents a full vector of sketches,  but if you do, then I would
>>> just iterate over the individual serdes and figure out how to package it.
>>> I really don't think you want to have to rewrite this low-level stuff.
>>>
>>> 4. Binary compatibility is critically important for us and I think will
>>> be important for you as well.  There are two dimensions of binary
>>> compatibility: history and language.  This means that a kll sketch
>>> serialized from Java, can be successfully read by C++ and visa versa.
>>> Similarly, a kll sketch serialized today will be able to be read many years
>>> from now.     Another aspect of this would mean being able to collect, say,
>>> 100 sketches that were not created using the NumPy version, and being able
>>> to put them together in a NumPy vector; and visa versa.
>>>
>>> I hope all of this make sense to you.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>>
>>>
>>> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>>>
>>> Michael,
>>> This is great!  What testing have you been able to do so far?
>>>
>>>
>>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Lee,
>>>
>>> Thanks for all of that information, it's quite helpful to get a better
>>> understanding of things.
>>>
>>> I've put the code on Github if you'd like to take a look:
>>> https://github.com/mdhimes/incubator-datasketches-cpp
>>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225387903&sdata=srvRz99iYMtAVoOCi51WCTOTFFzA4zJT%2BAU78fFKHcE%3D&reserved=0>
>>>
>>> Changes are
>>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>>> sketches.
>>> - new Python interface functions in python/src/kll_wrapper.cpp
>>>
>>> The only new dependency introduced is the pybind11/numpy.h header file.
>>> The new Numpy-compatible Python classes retain identical functionality to
>>> the existing classes (with minor changes to method names, e.g.,
>>> get_min_value --> get_min_values), except that I have not yet implemented
>>> merging or (de)serialization.  These would be straight-forward to
>>> implement, if needed.
>>>
>>> Re: characterization tests, I'll take a look at those tests you linked
>>> to and see about running them, time and compute resources permitting.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* leerho <le...@gmail.com>
>>> *Sent:* Sunday, May 10, 2020 5:32 PM
>>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> Michael,
>>>
>>> Is there a place on GitHub somewhere where I could look at your code so
>>> far?  The reason I ask, is before you do a PR, we would like to determine
>>> where a contribution such as this should be placed.
>>>
>>> Our library is split up among different repositories, determined by
>>> language and dependencies.  This keeps the user downloads smaller and more
>>> focused.   We have two library repos for the core sketch algorithms, one
>>> for Java and one for C++/Python, where the dependencies are very lean,
>>> which simplifies integration into other systems.  We have separate repos
>>> for adaptors, which depend on one of the core repos. On the Java side, we
>>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>>> dependencies for each of these are quite large.  For C++, we have a
>>> dedicated repo for the adaptors for PostgreSQL.
>>>
>>> Some of our adaptors are hosted with the target system.  For example,
>>> our Druid adaptors were contributed directly into Apache Druid.
>>>
>>> I assume your code has dependencies on Python, NumPy and
>>> DataSketches-cpp. It is not clear to me at the moment whether we should
>>> create a separate repo for this or have a separate group of directories in
>>> our cpp repo.
>>>
>>> ****
>>> We have a separate repo for our characterization code, which is not
>>> formally "released" as an Apache release.  It exists because we want others
>>> to be able to reproduce (or challenge) our claims of speed performance or
>>> accuracy.  It is the one repo where we have all languages and many
>>> different dependencies.  The coding style is not as rigorous or as well
>>> documented as our repos that do have formal releases.
>>>
>>> Characterization testing is distinctly different from Unit Tests, which
>>> basically checks all the main code paths and makes sure that the program
>>> works as it should.  The key metric is code coverage and Unit Tests should
>>> be fast as it is run on every check-in of new code.  Characterization is
>>> also different from Integration Testing, which is testing how well the code
>>> works when integrated into larger systems.
>>>
>>> Characterization tests are unique to our kind of library. Because our
>>> algorithms are probabilistic in nature, in order to verify accuracy or
>>> speed performance we need to run many thousands of trials to eliminate
>>> statistical noise in the results.  And when the data is large, this can
>>> take a long time.  You can peruse our website for many examples as all the
>>> plots result from various characterization studies.  What appears on the
>>> website is but a small fraction of all the testing we have done.
>>>
>>> There are no "standard" tests as every sketch is different so we have to
>>> decide what is important to measure for a particular sketch, but the basic
>>> groups are *speed* and *accuracy*.
>>>
>>> For speed there are many possible measurements, but the basic ones are
>>> update speed, merge speed, Serialization / Deserialization speed, get
>>> estimate or get result speeds.
>>>
>>> For accuracy we want to validate that the sketch is performing within
>>> the bounds of the theoretical error distribution.  We want to measure this
>>> accuracy in the context of a stand-alone, purely streaming sketch and also
>>> in the context of merging many sketches together.
>>>
>>> We also try to do these same tests comparing the results against other
>>> alternatives users might have.  We have performed these same
>>> characterizations on other publically available sketches as well as against
>>> traditional, brute-force approaches to solving the same problem.
>>>
>>> For the solution you have developed, we would depend on you to decide
>>> what properties would be most important to characterize for users of this
>>> solution.  It should be very similar to what you would write in a paper
>>> describing this solution;  you want to convince the reader that this is
>>> very useful and why.
>>>
>>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>>> would think you would want some characterizations similar to what we did
>>> for our studies
>>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=E1M4tMTmRum%2BXcUnv5ikiIRTGEIOMRoSygkdyC%2FZlbM%3D&reserved=0>
>>> comparing our older quantiles sketch and the KLL sketch.
>>>
>>> ****
>>> For the Java characterization tests, we have "standardized" on having
>>> small configuration files which define the key parameters of the test.
>>> These are simple text files
>>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=LjYOjw8GQo6vUIdgwNCi2GLZUZiHdimrggzgZUsZW40%3D&reserved=0>
>>> of key-value pairs.  We don't have any centralized definition of these
>>> pairs, just that they are human readable and intelligible.  They are
>>> different for each type of sketch.
>>>
>>> For the C++ tests, we don't have a collection of config files yet (this
>>> is one of our TODOs), but the same kind of parameters are set in the code
>>> itself.
>>>
>>> We will likely want to set up a separate directory for your
>>> characterization tests.
>>>
>>> I hope you find this helpful.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> The code is in a good state now.  It can take individual values, lists,
>>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>>> additional features, like being able to specify which sketches the user
>>> wants to, e.g., get quantiles for.
>>>
>>> But, I have only done minor testing with uniform and normal
>>> distributions.  I'd like to put it through more extensive testing (and some
>>> documentation) before releasing it, and it sounds like your
>>> characterization tests are the way to go -- it's not science if it's not
>>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>>> are there standard tests that have been used for the existing codebase?
>>>
>>> Michael
>>> ------------------------------
>>> *From:* leerho <le...@gmail.com>
>>> *Sent:* Saturday, May 9, 2020 7:21 PM
>>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> This is great.  The first step is to get your project working!  Once you
>>> think you are ready, it would be really useful if you could do some
>>> characterization testing in the NumPy environment. Characterization tests
>>> are what we run to fully understand how a sketch performs over a range of
>>> parameters and using thousands to millions of trials.  You can see some of
>>> the accuracy and speed performance plots of various sketches on our
>>> website.  Sometimes these can take hours to run.  We typically use
>>> synthetic data to drive our characterization tests to make them
>>> reproducible.
>>>
>>> Real data can also be used and one comparison test I would recommend is
>>> comparing how long it takes to get approximate results using sketches
>>> versus how long it would take to get exact results using brute force
>>> methods.  The bigger the data set is the better :)
>>>
>>> We don't have much experience with NumPy so this will be a new
>>> environment for us.  But before you get too deep into this please get us
>>> involved.  We have been characterizing these streaming algorithms for a
>>> number of years, and would like to help you.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>>> contribute.  I can't commit a lot of time to working on it, but with how
>>> things went for KLL I don't think it will take a lot of time for the other
>>> sketches if they are formatted in a similar manner.  Getting this library
>>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>>> others in my field to begin using it.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>>> *Sent:* Saturday, May 9, 2020 5:06 PM
>>> *To:* Michael Himes <mh...@knights.ucf.edu>;
>>> dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> This is just awesome!   Would you be interested in becoming a committer
>>> on our project?  It is not automatic, but we could work with you to bring
>>> you up to speed on the other sketches in the library.  If you could help us
>>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>>> necessary) it would be a very significant contribution and we would
>>> definitely want you to be part of our community!
>>>
>>> Thanks,
>>>
>>> Lee.
>>>
>>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Hi Lee,
>>>
>>> Thanks for the notice, I went ahead and subscribed to the list.
>>>
>>> As for Jon's email, this is actually what I have currently implemented!
>>> Once I finish ironing out a couple improvements, I'm going to move some
>>> code around to follow the existing coding style, put it on Github, and
>>> submit a pull request.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>>> *Sent:* Saturday, May 9, 2020 4:22 PM
>>> *To:* Michael Himes <mh...@knights.ucf.edu>
>>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> Hi Michael,
>>> I don't think you saw this email as I doubt you are subscribed to our
>>> dev@datasketches.apache.org email list.
>>>
>>> We would like to have you as part of our larger community, as others
>>> might also have suggestions on how to move your project forward.
>>> You can subscribe by sending an empty email to
>>> dev-subscribe@datasketches.apache.org.
>>>
>>> Lee.
>>>
>>> ---------- Forwarded message ---------
>>> From: *Jon Malkin* <jo...@gmail.com>
>>> Date: Thu, May 7, 2020 at 4:11 PM
>>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>> To: <de...@datasketches.apache.org>
>>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>>
>>>
>>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>>> wrappers themselves are quite thin, but they do have examples of calling
>>> functions defined in the wrapper as opposed to only the sketch object.
>>>
>>> I believe the easiest way to do this will be to define a pretty simple
>>> C++ object and create a pybind wrapper for it.  That object would contain a
>>> std::vector<kll_sketch>.  Then you'd define an update method for your
>>> custom object that iterates through a numpy array and calls update() on the
>>> appropriate sketch. You'd also want to define something similar for
>>> get_quantile() or whatever other methods you need that iterates through
>>> that vector of sketches and returns the result in a numpy array.
>>>
>>> That's a pretty lightweight object. And then you'd use a similar thin
>>> pybind wrapper around it to make it play nicely with python. Since our C++
>>> library is just templates, you'd end up with a free-standing library, with
>>> no requirement that the base datasketches library be involved.
>>>
>>>   jon
>>>
>>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> I would be happy to share whatever I come up with (if anything).  The
>>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>>> library, it would be very useful to myself and others if it were a part of
>>> Numpy/Scipy.
>>>
>>> For what it's worth, passing in a Numpy array and manipulating it from
>>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>>> sketches and pass the values along to that looks like it'll be more
>>> challenging, there is a lot of code here and it'll take some time for me to
>>> familiarize myself with it.
>>>
>>> Michael
>>> ------------------------------
>>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>>> *Sent:* Thursday, May 7, 2020 12:00 PM
>>> *To:* Michael Himes <mh...@knights.ucf.edu>
>>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> If you do figure out how to do this, it would be great if you could
>>> share it with us.  We would like to extend  it to other sketches and submit
>>> it as an added functionality to NumPy.  I have been looking at the NomPy
>>> and SciPy libraries and have not found anything close to what we have.
>>>
>>> Lee.
>>>
>>>
>>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Hi Lee, Jon,
>>>
>>> Thanks for the information.  I tried to vectorize things this morning
>>> and ran into that exact problem -- since the offsets can differ, it leads
>>> to slices of different lengths, which wouldn't be possible to store as a
>>> single Numpy array.
>>>
>>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>>> where all m elements of each vector are a float (no NaNs or missing
>>> values).  I am interested in quantiles at rank r for each of the m
>>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>>> sketch is not required (though it would be a nice feature), and sketches
>>> would not need to be merged (no serialization/deserialization).
>>>
>>> Not surprisingly, it looks like your original suggestion of handling
>>> this on the C++ side is the way to go.  Once I have time to dive into the
>>> code, my plan is to write something that implements what you described in
>>> the earlier email.
>>>
>>> Thanks,
>>> Michael
>>> ------------------------------
>>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>>> *To:* Michael Himes <mh...@knights.ucf.edu>
>>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> Michael,
>>>
>>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>>> not work for another reason and that is for each dimension, choosing
>>> whether to delete the odd or even values in the compactor must be random
>>> and independent of the other dimensions.  Otherwise you might get unwanted
>>> correlation effects between the dimensions.
>>>
>>> This is another argument that you should have independent compactors for
>>> each dimension.  So you might as well stick with individual sketches for
>>> each dimension.
>>>
>>> Lee.
>>>
>>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>>> wrote:
>>>
>>> Michael,
>>>
>>> Allow me to back up for a moment to make sure I understand your problem.
>>>
>>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>>> element, or equivalently, the *i*th dimension.
>>>
>>> Assumptions:
>>>
>>>    - All vectors, *V*, are of the same size *m.*
>>>    - All elements, *x_i*, are valid numbers of the same type. No
>>>    missing values, and if you are using *floats*, this means no *NaN*s.
>>>
>>> In aggregate, the *n* vectors represent *m* *independent* distributions
>>> of values.
>>>
>>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>>> query.
>>>
>>> ****
>>> To do this, using your idea, would require vectorization of the entire
>>> sketch and not just the compactors.  The inputs are vectors, the result of
>>> operations such as getQuantile(r), getQuantileUpperBound(r),
>>> getQuantileLowerBound(r), are also vectors.
>>>
>>> This sketch will be a large data structure, which leads to more
>>> questions ...
>>>
>>>    - Do you anticipate having many of these vectorized sketches
>>>    operating simultaneously?
>>>    - Is there any requirement to store and later retrieve this sketch?
>>>    - Or, the nearly equivalent question: Do you require merging
>>>    of these sketches (across clusters, for example)?  Which also means
>>>    serialization and deserialization.
>>>
>>> I am concerned that this vector-quantiles sketch would be limited in the
>>> sense that it may not be as widely applicable as it could be.
>>>
>>> Our experience with real data is that it is ugly with missing values,
>>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>>> compactor.  Each dimension *i* would need a separate independent
>>> compactor because the compaction times will vary depending on missing
>>> values or NaNs in the data.
>>>
>>> Spacewise, I don't think having separate independent sketches for each
>>> dimension would be much smaller than vectorizing the entire sketch, because
>>> the internals of the existing sketch are already quite space efficient
>>> leveraging compact arrays, etc.
>>>
>>> As a first step I would favor figuring out how to access the NumPy data
>>> structure on the C++ side, having individual sketches for each
>>> dimension, and doing the iterations updating the sketches in C++.   It also
>>> has the advantage of leveraging code that exists and it would automatically
>>> be able to leverage any improvements to the sketch code over time.  In
>>> addition, it could be a prototype of how to integrate other sketches into
>>> the NumPy ecosystem.
>>>
>>> A fully vectorized sketch would be a separate implementation and would
>>> not be able to take advantage of these points.
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Hi Lee,
>>>
>>> I don't think there is a problem with the DataSketches library, just
>>> that it doesn't support what I am trying to do -- looking in the
>>> documentation, it only supports streams of ints or floats, and those
>>> situations work fine for me.  Here's what I did:
>>> - began with the KLL test .py file:
>>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=AAyQLO7aJVSOsZTMzfF84WhXjDyM%2B8ZWd2pXRdDOdUE%3D&reserved=0>
>>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>>> Numpy array of 10 identical values.
>>> - ran the code
>>>
>>> This leads to the following error, as expected:
>>> TypeError: update(): incompatible function arguments. The following
>>> argument types are supported:
>>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>>
>>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>>
>>> It's not coded to support Numpy arrays, therefore it complains.  What I
>>> would ideally like to have happen in this scenario is it would treat each
>>> element in the array as a separate stream.  Then, later when getting a
>>> given quantile, it would give 10 values, one for each stream.  I don't see
>>> an easy approach to implementing this on the Python side besides a very
>>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>>> looked into the codebase to see how I might modify things there to support
>>> this functionality.
>>>
>>> Re: the streaming-quantiles code being easily modified, I believe the
>>> only necessary changes would be changing the Compactor class to be a
>>> subclass of numpy.ndarray, rather than list, and implementing methods for
>>> the list-specific methods that are used, like .append().  Then, it isn't
>>> necessary to loop over the streams since we can make use of Numpy's
>>> broadcasting, which will handle the looping in its C++ code, as you
>>> mentioned.  I'll work on this and see if it really is as straight-forward
>>> as it seems.
>>>
>>> If you have any advice on how to use DataSketches for my problem, I'm
>>> certainly open to that.
>>>
>>> Thanks,
>>> Michael
>>> ------------------------------
>>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>>> *To:* Michael Himes <mh...@knights.ucf.edu>;
>>> dev@datasketches.apache.org <de...@datasketches.apache.org>
>>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>>> edo@edoliberty.com>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> Michael,
>>>
>>> Thank you for considering the DataSketches library.   I am adding this
>>> thread to our dev@datasketches.apache.org so that our whole team can
>>> contribute to finding a solution for you.
>>>
>>> WRT the error you experienced, please help us help you by sharing with
>>> us what the exact error was.
>>>
>>> We are about to release a major upgrade to the DataSketches C++/Python
>>> product in the next few weeks.  We have fixed a number of stability issues
>>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>>> you to get your problem solved.
>>>
>>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>>> have real-time systems today that generate and process over 1e9 sketches
>>> every day.  Unfortunately our experience tells us that looping in Python
>>> code will be 10 to 100 times slower than Java or C++.  This is because the
>>> code would have to switch from Python to C++ for every vector element.
>>>
>>> By comparison, the streaming-quantiles code could be easily modified to
>>> use Numpy arrays and operate on vectors.
>>>
>>>
>>> I would like to understand more about what you have in mind that would
>>> be "easily modified".
>>>
>>> NumPy achieves its speed performance by doing all of the matrix
>>> operations in pre-compiled C++ code.  To achieve best performance, we would
>>> want to read and loop through the NumPy data structure on the C++ side
>>> leveraging the C++ DataSketches library directly.  I am not sure what would
>>> be involved to actually accomplish that.
>>>
>>> But first we need to get your Python + NumPy code working correctly with
>>> our library so we can find out what its actual performance is.
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Hi Edo, Lee,
>>>
>>> Thanks for the prompt response.  I looked at the datasketches library,
>>> and while it seems to have a lot more features, it looks like it'll be a
>>> lot more difficult to get it to work for my desired use case.
>>>
>>> My problem is that I need quantiles for each element of a vector (length
>>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>>> but it throws an error, so it doesn't seem like datasketches handles this
>>> situation currently.
>>>
>>> To use datasketches, I think I would need to instantiate 1 object per
>>> vector element, and I suspect this will slow things down considerably due
>>> to iterating over the objects when each vector is processed.  By
>>> comparison, the streaming-quantiles code could be easily modified to use
>>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>>> and found equivalent behavior, as expected.
>>>
>>> Do you have any recommendation(s) for this situation?  Are there known
>>> limitations of the streaming-quantiles code that would cause issues for my
>>> use case?  Are the other methods offered in datasketches 'better' than the
>>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>>> expertise, so I appreciate any advice you can offer, and I will of course
>>> acknowledge it in the publication.
>>>
>>> Best,
>>> Michael
>>>
>>> ------------------------------
>>> *From:* Edo Liberty <ed...@gmail.com>
>>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>>> mhimes@knights.ucf.edu>
>>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>>> open-source academic software
>>>
>>> +Lee
>>>
>>> Hi Michael, Thanks for reaching out.
>>> While you can certainly do that, I recommend using the python-Binded
>>> datasketches library. It will be more robust, faster, and bug free than my
>>> code :)
>>>
>>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>>> wrote:
>>>
>>> Hi Edo,
>>>
>>> I'm currently working on a Python package for
>>> machine-learning-accelerated exoplanet modeling.  It is free and open
>>> source (see here if you're curious https://github.com/exosports/HOMER
>>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=s2C1taJAzPJmVhheL3if8KWCIDb8fSyr6559GTGdHpU%3D&reserved=0>),
>>> and it's meant purely for reproducible academic research.
>>>
>>> I'm adding some new features to the software, and one of them requires
>>> computing quantiles for a data set that cannot fit into memory.  After
>>> searching around for different methods to do this, your KLL method seemed
>>> to be a good option in terms of speed and space requirements.
>>>
>>> Rather than reinvent the wheel and code my own implementation of the
>>> method from scratch, I was wondering if you'd be willing to allow me to use
>>> your code?  I don't see a license, so I wanted to make sure you're okay
>>> with this.  I could implement it as a submodule within my repo, or I could
>>> only include the kll.py file and add some additional comments pointing to
>>> your repo and such, whichever you prefer.
>>>
>>> Best,
>>> Michael
>>>
>>> --
>>> From my cell phone.
>>>
>>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

We had a similar issue in Java trying to use JNI to access C code.  Every
transition across the "boundary" between Java and C took from 10 to 100
microseconds.  This made the JNI option pretty useless from our
standpoint.

I don't know python that well, but I could well imagine that there may be a
similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of
queries in Python vs using even these (what we consider slow performing)
sketches in Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin <jo...@gmail.com> wrote:

> I tried comparing the performance of the existing floats sketch vs the new
> thing with a single dimension. And then I made a second update method that
> handles a single item rather than creating an array of length 1 each time.
> Otherwise, the scripts were as identical as possible. I fed in 2^25
> gaussian-distributed values and queried for the mean to force some
> computation on the sketch. I think get_quantile(0.5) vs
> get_quantiles(0.5)[0][0] was the only difference,
>
> Existing kll_floats_sketch: 31s
> kll_floatarray_sketches: 123s
> with single-item update: 80s
>
> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
> so this seemed more fair)
>
> I didn't try anything with trying to batch updates, even though in theory
> the new object can support that. This was more a test to see the
> performance impact of using it for all kll sketches.
>
> At some level, if you're already ok taking the speed hit for python vs C++
> then maybe it doesn't matter. But >2x still seems significant to me.
>
>   jon
>
> On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Great, I'll be submitting the pull request shortly.  The codebase I'm
>> working with doesn't have any of the changes made in the past week or so,
>> hopefully that isn't too much of a hassle to merge.
>>
>> As an aside, my employer encourages us to contribute code to libraries
>> like this, so I'm happy to work on additional features for the Python
>> interface as needed.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Thursday, May 14, 2020 6:56 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> We've been polishing things up for a release, so that was one of several
>> things that we fixed over the last several days. Thank you for finding it!
>>
>> Anyway, if you're generally happy with the state of things (and are
>> allowed to under any employment terms), I'd encourage you to create pull
>> request to merge your changes into the main repo. It doesn't need to be
>> perfect as we can always make changes as part of the PR review or
>> post-merge.
>>
>> Thanks,
>>   jon
>>
>>
>> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Thanks for taking a look, Jon.
>>
>> I pushed an update that address 2 & 4.
>>
>> #3 is actually something I had a question about. I've tested passing
>> numpy.nan into the update function, and it doesn't appear to break anything
>> (min, max, etc all still work correctly).  However, the reported number of
>> items per sketch counts the nan entries.  Is this the expected behavior, or
>> should the get_n() method return a number that does not count the nans it
>> has seen?  I expected the latter, so I'm worried that numpy's nan is being
>> treated differently.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Monday, May 11, 2020 4:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> I didn't look in super close detail, but the code overall looks pretty
>> good. Comments are below.
>>
>> Note that not all of these necessarily need changes or replies. I'm just
>> trying to document things we'll want to think about for keeping the library
>> general-purpose (and we can always make changes after merging, of course).
>>
>> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
>> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
>> to operate on an entire vector at a time (vs treating each dimension
>> independently) that'd become confusing. I think an inherently vectorized
>> version would be a very different beast, but I always worry I'm not being
>> imaginative enough. If merging into the Apache codebase, I'd probably wait
>> to see what the file looks like with the renaming before a final decision
>> on moving to its own file.
>>
>> 2. What happens if the input to update() has >2 dimensions? If that'd be
>> invalid, we should explicitly check and complain. If it'll Do The Right
>> Thing by operating on the first 2 dimensions (meaning correct indices)
>> that's fine, but otherwise should probably complain.
>>
>> 3. Can this handle sparse input vectors? Not sure how important that is
>> in general, even if your project doesn't require it. kll_sketch will ignore
>> NaNs, so those appearing would mean the number of items per sketch can
>> already differ.
>>
>> 4. I'd probably eat the very slightly increased space and go with 32 bits
>> for the number of dimensions (aka number of sketches). If trying to look at
>> a distribution of values for some machine learning application, it'd be
>> easy to overflow 65k dimensions for some tasks.
>>
>> 5. I imagine you've realized that it's easiest to do unit tests from
>> python in this case. That's another advantage of having this live in the
>> wrapper.
>>
>> 6. Finally, that assert issue is already obsolete :). Asserts were
>> converted if/throw exceptions late last week. It'll be flagged as a
>> conflict in merging, so no worries for now.
>>
>> Looking good at this point. And as I said, not all of these need changes
>> or comments from you.
>>
>>   jon
>>
>> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
>> file -- I'll leave it to you to decide if it's better as its own file.
>>
>> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
>> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
>> an include of assert.h there and then it compiled without issue.  It's
>> possible that other compilers will also complain about that, so maybe this
>> is a good update to the main branch.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <jo...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 10:47 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> My only comment without having looked at actual code is that the new
>> class would be more appropriate in the python wrapper. Maybe even drop it
>> in as it's own file, as that would decrease recompile time a bit when
>> debugging (that's pybind's suggestion, anyway). Probably not a huge
>> difference with how light these wrappers are.
>>
>> If this is something that becomes widely used, to where we look at
>> pushing it into the base library, we'd look at whether we could share any
>> data across sketches. But we're far from that point currently. It'd be nice
>> to need to consider that.
>>
>>   jon
>>
>> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,  this has been a great interchange and certainly will allow us
>> to move forward more quickly.
>>
>> Thank you for working on this on a Mother's Day Sunday!
>>
>> I'm sure Alex and Jon may have more questions, when they get a chance to
>> look at it starting tomorrow.
>>
>> Cheers, and be safe and well!
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Re: testing, so far I've just done glorified unit tests for uniform and
>> normal distributions of varying sizes.  I plan to do some timing tests vs
>> the existing single-sketch Python class to see how it compares for 1, 10,
>> and 100 streams.
>>
>> 1. That makes sense.  One option to allow full Numpy compatibility but
>> without requiring a Python user to use Numpy would be to return everything
>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>> lists into arrays, and non-Numpy users would be unaffected (aside from
>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>> set when instantiating the object that would control whether things are
>> returned as lists or arrays, though this still requires the numpy.h header
>> file.
>>
>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>> class called kll_sketches, which spawns a user-specified number of
>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>> the existing code for that.  For fast execution in Python, the parallel
>> sketches must be spawned in C++, but the existing Python object could only
>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>
>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>> todo's, and that is one of them -- the plan is to do like you described and
>> call the relevant kll_sketch method on each of the sketches and return that
>> to Python in a sensible format.  For deserialization, it would just iterate
>> through them and load them into the kll_sketches object.  I don't require
>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>> sometime this week after I finish my work for the day, shouldn't take long
>> to do.
>>
>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>> thought is that since under the hood everything is using the existing
>> kll_sketch class, it would have full compatibility with the rest of the
>> library (once SerDe is added in).
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 8:42 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
>> a closer look this next week.  They wrote this code so they are much closer
>> to it than I.
>>
>> What you have done so far makes sense for you as you want to get this
>> working in the NumPy environment as quickly as possible.  As soon as we
>> start thinking about incorporating this into our library other concerns
>> become important.
>>
>> 1. Adding API calls is the recommended way to add functionality (like
>> NumPy) to a library.  We cannot change API calls in a way that is only
>> useful with NumPy, because it would seriously impact other users of the
>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>> exist in the same sketch API, then we need to consider other alternatives.
>>
>> 2.  Based on our previous discussions, I didn't envision that you would
>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>> class that enables vectorized input to a vector of sketches and a
>> vectorized get result that creates a vector result from a vector of
>> sketches.  This would isolate the changes you need for NumPy from the
>> sketch itself.  This is also much easier to support, maintain and debug.
>>
>> 3. If you don't change the internals of the sketch then SerDe becomes
>> pretty straightforward. I don't know if you need a single serialization
>> that represents a full vector of sketches,  but if you do, then I would
>> just iterate over the individual serdes and figure out how to package it.
>> I really don't think you want to have to rewrite this low-level stuff.
>>
>> 4. Binary compatibility is critically important for us and I think will
>> be important for you as well.  There are two dimensions of binary
>> compatibility: history and language.  This means that a kll sketch
>> serialized from Java, can be successfully read by C++ and visa versa.
>> Similarly, a kll sketch serialized today will be able to be read many years
>> from now.     Another aspect of this would mean being able to collect, say,
>> 100 sketches that were not created using the NumPy version, and being able
>> to put them together in a NumPy vector; and visa versa.
>>
>> I hope all of this make sense to you.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,
>> This is great!  What testing have you been able to do so far?
>>
>>
>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225387903&sdata=srvRz99iYMtAVoOCi51WCTOTFFzA4zJT%2BAU78fFKHcE%3D&reserved=0>
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=E1M4tMTmRum%2BXcUnv5ikiIRTGEIOMRoSygkdyC%2FZlbM%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=LjYOjw8GQo6vUIdgwNCi2GLZUZiHdimrggzgZUsZW40%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> dev@datasketches.apache.org email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> dev-subscribe@datasketches.apache.org.
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <jo...@gmail.com>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <de...@datasketches.apache.org>
>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=AAyQLO7aJVSOsZTMzfF84WhXjDyM%2B8ZWd2pXRdDOdUE%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=s2C1taJAzPJmVhheL3if8KWCIDb8fSyr6559GTGdHpU%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

I tried comparing the performance of the existing floats sketch vs the new
thing with a single dimension. And then I made a second update method that
handles a single item rather than creating an array of length 1 each time.
Otherwise, the scripts were as identical as possible. I fed in 2^25
gaussian-distributed values and queried for the mean to force some
computation on the sketch. I think get_quantile(0.5) vs
get_quantiles(0.5)[0][0] was the only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG
so this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory
the new object can support that. This was more a test to see the
performance impact of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++
then maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Great, I'll be submitting the pull request shortly.  The codebase I'm
> working with doesn't have any of the changes made in the past week or so,
> hopefully that isn't too much of a hassle to merge.
>
> As an aside, my employer encourages us to contribute code to libraries
> like this, so I'm happy to work on additional features for the Python
> interface as needed.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Thursday, May 14, 2020 6:56 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> We've been polishing things up for a release, so that was one of several
> things that we fixed over the last several days. Thank you for finding it!
>
> Anyway, if you're generally happy with the state of things (and are
> allowed to under any employment terms), I'd encourage you to create pull
> request to merge your changes into the main repo. It doesn't need to be
> perfect as we can always make changes as part of the PR review or
> post-merge.
>
> Thanks,
>   jon
>
>
> On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225387903&sdata=srvRz99iYMtAVoOCi51WCTOTFFzA4zJT%2BAU78fFKHcE%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=E1M4tMTmRum%2BXcUnv5ikiIRTGEIOMRoSygkdyC%2FZlbM%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=LjYOjw8GQo6vUIdgwNCi2GLZUZiHdimrggzgZUsZW40%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=AAyQLO7aJVSOsZTMzfF84WhXjDyM%2B8ZWd2pXRdDOdUE%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=s2C1taJAzPJmVhheL3if8KWCIDb8fSyr6559GTGdHpU%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Great, I'll be submitting the pull request shortly.  The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Thursday, May 14, 2020 6:56 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge.

Thanks,
  jon

On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225387903&sdata=srvRz99iYMtAVoOCi51WCTOTFFzA4zJT%2BAU78fFKHcE%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=E1M4tMTmRum%2BXcUnv5ikiIRTGEIOMRoSygkdyC%2FZlbM%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=LjYOjw8GQo6vUIdgwNCi2GLZUZiHdimrggzgZUsZW40%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=AAyQLO7aJVSOsZTMzfF84WhXjDyM%2B8ZWd2pXRdDOdUE%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=s2C1taJAzPJmVhheL3if8KWCIDb8fSyr6559GTGdHpU%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

We've been polishing things up for a release, so that was one of several
things that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed
to under any employment terms), I'd encourage you to create pull request to
merge your changes into the main repo. It doesn't need to be perfect as we
can always make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Thanks for taking a look, Jon.
>
> I pushed an update that address 2 & 4.
>
> #3 is actually something I had a question about. I've tested passing
> numpy.nan into the update function, and it doesn't appear to break anything
> (min, max, etc all still work correctly).  However, the reported number of
> items per sketch counts the nan entries.  Is this the expected behavior, or
> should the get_n() method return a number that does not count the nans it
> has seen?  I expected the latter, so I'm worried that numpy's nan is being
> treated differently.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Monday, May 11, 2020 4:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> I didn't look in super close detail, but the code overall looks pretty
> good. Comments are below.
>
> Note that not all of these necessarily need changes or replies. I'm just
> trying to document things we'll want to think about for keeping the library
> general-purpose (and we can always make changes after merging, of course).
>
> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
> to operate on an entire vector at a time (vs treating each dimension
> independently) that'd become confusing. I think an inherently vectorized
> version would be a very different beast, but I always worry I'm not being
> imaginative enough. If merging into the Apache codebase, I'd probably wait
> to see what the file looks like with the renaming before a final decision
> on moving to its own file.
>
> 2. What happens if the input to update() has >2 dimensions? If that'd be
> invalid, we should explicitly check and complain. If it'll Do The Right
> Thing by operating on the first 2 dimensions (meaning correct indices)
> that's fine, but otherwise should probably complain.
>
> 3. Can this handle sparse input vectors? Not sure how important that is in
> general, even if your project doesn't require it. kll_sketch will ignore
> NaNs, so those appearing would mean the number of items per sketch can
> already differ.
>
> 4. I'd probably eat the very slightly increased space and go with 32 bits
> for the number of dimensions (aka number of sketches). If trying to look at
> a distribution of values for some machine learning application, it'd be
> easy to overflow 65k dimensions for some tasks.
>
> 5. I imagine you've realized that it's easiest to do unit tests from
> python in this case. That's another advantage of having this live in the
> wrapper.
>
> 6. Finally, that assert issue is already obsolete :). Asserts were
> converted if/throw exceptions late last week. It'll be flagged as a
> conflict in merging, so no worries for now.
>
> Looking good at this point. And as I said, not all of these need changes
> or comments from you.
>
>   jon
>
> On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604586993&sdata=A%2F4%2B1LIzTcBIn5kZG62FPC5zMbX6neTBzzbRRrDg9bU%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=jIPYbqCi0PFQpqKmxUqDRwLhRZYt9mODB%2Fd86O18Txo%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=miV7FSNEyAv5iWo%2Be%2BZQHAgJmkZyYhEdrb38qGMcjCQ%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=DhO648TzVPwtv7TqAiDUgzJLX8F7EO1QUTobVDzvea0%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604606907&sdata=RnRoQTIpyUcumR1LqItPS3LJ%2FKf%2BIYuxUO8ZloKKkaA%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly).  However, the reported number of items per sketch counts the nan entries.  Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen?  I expected the latter, so I'm worried that numpy's nan is being treated differently.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Monday, May 11, 2020 4:32 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

I didn't look in super close detail, but the code overall looks pretty good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604586993&sdata=A%2F4%2B1LIzTcBIn5kZG62FPC5zMbX6neTBzzbRRrDg9bU%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=jIPYbqCi0PFQpqKmxUqDRwLhRZYt9mODB%2Fd86O18Txo%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=miV7FSNEyAv5iWo%2Be%2BZQHAgJmkZyYhEdrb38qGMcjCQ%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604596948&sdata=DhO648TzVPwtv7TqAiDUgzJLX8F7EO1QUTobVDzvea0%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C143a0d8b1c7a41c220f608d7f5ea7182%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637248259604606907&sdata=RnRoQTIpyUcumR1LqItPS3LJ%2FKf%2BIYuxUO8ZloKKkaA%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

I didn't look in super close detail, but the code overall looks pretty
good. Comments are below.

Note that not all of these necessarily need changes or replies. I'm just
trying to document things we'll want to think about for keeping the library
general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch.
Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
to operate on an entire vector at a time (vs treating each dimension
independently) that'd become confusing. I think an inherently vectorized
version would be a very different beast, but I always worry I'm not being
imaginative enough. If merging into the Apache codebase, I'd probably wait
to see what the file looks like with the renaming before a final decision
on moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be
invalid, we should explicitly check and complain. If it'll Do The Right
Thing by operating on the first 2 dimensions (meaning correct indices)
that's fine, but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in
general, even if your project doesn't require it. kll_sketch will ignore
NaNs, so those appearing would mean the number of items per sketch can
already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits
for the number of dimensions (aka number of sketches). If trying to look at
a distribution of values for some machine learning application, it'd be
easy to overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python
in this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were
converted if/throw exceptions late last week. It'll be flagged as a
conflict in merging, so no worries for now.

Looking good at this point. And as I said, not all of these need changes or
comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
> file -- I'll leave it to you to decide if it's better as its own file.
>
> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
> an include of assert.h there and then it compiled without issue.  It's
> possible that other compilers will also complain about that, so maybe this
> is a good update to the main branch.
>
> Michael
> ------------------------------
> *From:* Jon Malkin <jo...@gmail.com>
> *Sent:* Sunday, May 10, 2020 10:47 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> My only comment without having looked at actual code is that the new class
> would be more appropriate in the python wrapper. Maybe even drop it in as
> it's own file, as that would decrease recompile time a bit when debugging
> (that's pybind's suggestion, anyway). Probably not a huge difference with
> how light these wrappers are.
>
> If this is something that becomes widely used, to where we look at pushing
> it into the base library, we'd look at whether we could share any data
> across sketches. But we're far from that point currently. It'd be nice to
> need to consider that.
>
>   jon
>
> On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:
>
> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620859998147&sdata=ze5%2F%2BPsShBDuS%2FmAGTzqWsPgd4EvjTMMWSMGJnNQREs%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620859998147&sdata=ElEH7Qf75nJfVJFyAV3Fmkg%2B63Dh57XNKCFaF3SkR%2Fc%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620860008143&sdata=Rl99tUJRBFmSX5Ij%2FqaIfMYASJ0vpqYQHuKLBE10GGQ%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620860018138&sdata=Bmg5LWKR0Vnp198x9Sa7Y2dG9JvsM%2BHXCtoA9A3MIRo%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620860028130&sdata=m%2FKhvg5uWrXidwUlcgRMfcUIKFBzBIJcgnRyHSpuoi8%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include of assert.h there and then it compiled without issue.  It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch.

Michael
________________________________
From: Jon Malkin <jo...@gmail.com>
Sent: Sunday, May 10, 2020 10:47 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.

On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?

On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620859998147&sdata=ze5%2F%2BPsShBDuS%2FmAGTzqWsPgd4EvjTMMWSMGJnNQREs%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620859998147&sdata=ElEH7Qf75nJfVJFyAV3Fmkg%2B63Dh57XNKCFaF3SkR%2Fc%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620860008143&sdata=Rl99tUJRBFmSX5Ij%2FqaIfMYASJ0vpqYQHuKLBE10GGQ%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620860018138&sdata=Bmg5LWKR0Vnp198x9Sa7Y2dG9JvsM%2BHXCtoA9A3MIRo%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C234859f32c844cc7868808d7f555b8ee%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637247620860028130&sdata=m%2FKhvg5uWrXidwUlcgRMfcUIKFBzBIJcgnRyHSpuoi8%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

My only comment without having looked at actual code is that the new class
would be more appropriate in the python wrapper. Maybe even drop it in as
it's own file, as that would decrease recompile time a bit when debugging
(that's pybind's suggestion, anyway). Probably not a huge difference with
how light these wrappers are.

If this is something that becomes widely used, to where we look at pushing
it into the base library, we'd look at whether we could share any data
across sketches. But we're far from that point currently. It'd be nice to
need to consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho <le...@gmail.com> wrote:

> Michael,  this has been a great interchange and certainly will allow us to
> move forward more quickly.
>
> Thank you for working on this on a Mother's Day Sunday!
>
> I'm sure Alex and Jon may have more questions, when they get a chance to
> look at it starting tomorrow.
>
> Cheers, and be safe and well!
>
> Lee.
>
> On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Re: testing, so far I've just done glorified unit tests for uniform and
>> normal distributions of varying sizes.  I plan to do some timing tests vs
>> the existing single-sketch Python class to see how it compares for 1, 10,
>> and 100 streams.
>>
>> 1. That makes sense.  One option to allow full Numpy compatibility but
>> without requiring a Python user to use Numpy would be to return everything
>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>> lists into arrays, and non-Numpy users would be unaffected (aside from
>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>> set when instantiating the object that would control whether things are
>> returned as lists or arrays, though this still requires the numpy.h header
>> file.
>>
>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>> class called kll_sketches, which spawns a user-specified number of
>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>> the existing code for that.  For fast execution in Python, the parallel
>> sketches must be spawned in C++, but the existing Python object could only
>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>
>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>> todo's, and that is one of them -- the plan is to do like you described and
>> call the relevant kll_sketch method on each of the sketches and return that
>> to Python in a sensible format.  For deserialization, it would just iterate
>> through them and load them into the kll_sketches object.  I don't require
>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>> sometime this week after I finish my work for the day, shouldn't take long
>> to do.
>>
>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>> thought is that since under the hood everything is using the existing
>> kll_sketch class, it would have full compatibility with the rest of the
>> library (once SerDe is added in).
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 8:42 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
>> a closer look this next week.  They wrote this code so they are much closer
>> to it than I.
>>
>> What you have done so far makes sense for you as you want to get this
>> working in the NumPy environment as quickly as possible.  As soon as we
>> start thinking about incorporating this into our library other concerns
>> become important.
>>
>> 1. Adding API calls is the recommended way to add functionality (like
>> NumPy) to a library.  We cannot change API calls in a way that is only
>> useful with NumPy, because it would seriously impact other users of the
>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>> exist in the same sketch API, then we need to consider other alternatives.
>>
>> 2.  Based on our previous discussions, I didn't envision that you would
>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>> class that enables vectorized input to a vector of sketches and a
>> vectorized get result that creates a vector result from a vector of
>> sketches.  This would isolate the changes you need for NumPy from the
>> sketch itself.  This is also much easier to support, maintain and debug.
>>
>> 3. If you don't change the internals of the sketch then SerDe becomes
>> pretty straightforward. I don't know if you need a single serialization
>> that represents a full vector of sketches,  but if you do, then I would
>> just iterate over the individual serdes and figure out how to package it.
>> I really don't think you want to have to rewrite this low-level stuff.
>>
>> 4. Binary compatibility is critically important for us and I think will
>> be important for you as well.  There are two dimensions of binary
>> compatibility: history and language.  This means that a kll sketch
>> serialized from Java, can be successfully read by C++ and visa versa.
>> Similarly, a kll sketch serialized today will be able to be read many years
>> from now.     Another aspect of this would mean being able to collect, say,
>> 100 sketches that were not created using the NumPy version, and being able
>> to put them together in a NumPy vector; and visa versa.
>>
>> I hope all of this make sense to you.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>>
>> Michael,
>> This is great!  What testing have you been able to do so far?
>>
>>
>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439903753&sdata=6CX64Wm8RXsSCeVoF3SqTLfgOKYeB9PoJqWmCJem2zc%3D&reserved=0>
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439903753&sdata=2sU281qEZ%2FYtg1EPjGf%2Bx%2B%2Fa%2BMitnuBXok72PELBDMc%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439913744&sdata=WH8QXNwiJ5W%2Bln5j%2B%2FY7lsLsLKE9mSwREvGliIVakwc%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> dev@datasketches.apache.org email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> dev-subscribe@datasketches.apache.org.
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <jo...@gmail.com>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <de...@datasketches.apache.org>
>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439923741&sdata=ayJcko9BKjFBOZVtAy0JMdHD7hT2gzR9I0ge%2BRXRbzk%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439923741&sdata=kcF8PWEmbDKrinPPdyrv1mNangvTEvOztIpcNLC23gc%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

Michael,  this has been a great interchange and certainly will allow us to
move forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to
look at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Re: testing, so far I've just done glorified unit tests for uniform and
> normal distributions of varying sizes.  I plan to do some timing tests vs
> the existing single-sketch Python class to see how it compares for 1, 10,
> and 100 streams.
>
> 1. That makes sense.  One option to allow full Numpy compatibility but
> without requiring a Python user to use Numpy would be to return everything
> as lists, rather than Numpy arrays.  Numpy users could then convert those
> lists into arrays, and non-Numpy users would be unaffected (aside from
> needing the pybind11/numpy.h header).  Alternatively, some flag could be
> set when instantiating the object that would control whether things are
> returned as lists or arrays, though this still requires the numpy.h header
> file.
>
> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
> class called kll_sketches, which spawns a user-specified number of
> sketches.  Each of those sketches are kll_sketch objects and uses all of
> the existing code for that.  For fast execution in Python, the parallel
> sketches must be spawned in C++, but the existing Python object could only
> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
> file?  I suppose you wouldn't need this class if you weren't using Python.
>
> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
> todo's, and that is one of them -- the plan is to do like you described and
> call the relevant kll_sketch method on each of the sketches and return that
> to Python in a sensible format.  For deserialization, it would just iterate
> through them and load them into the kll_sketches object.  I don't require
> it for my project, so I didn't bother to wrap that yet -- I'll take a look
> sometime this week after I finish my work for the day, shouldn't take long
> to do.
>
> 4. That makes sense.  Does using Numpy complicate that at all?  My thought
> is that since under the hood everything is using the existing kll_sketch
> class, it would have full compatibility with the rest of the library (once
> SerDe is added in).
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 8:42 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
> a closer look this next week.  They wrote this code so they are much closer
> to it than I.
>
> What you have done so far makes sense for you as you want to get this
> working in the NumPy environment as quickly as possible.  As soon as we
> start thinking about incorporating this into our library other concerns
> become important.
>
> 1. Adding API calls is the recommended way to add functionality (like
> NumPy) to a library.  We cannot change API calls in a way that is only
> useful with NumPy, because it would seriously impact other users of the
> library that don't need NumPy.  If both sets of calls cannot simultaneously
> exist in the same sketch API, then we need to consider other alternatives.
>
> 2.  Based on our previous discussions, I didn't envision that you would
> have to change the kll_sketch code itself other than perhaps a "wrapper"
> class that enables vectorized input to a vector of sketches and a
> vectorized get result that creates a vector result from a vector of
> sketches.  This would isolate the changes you need for NumPy from the
> sketch itself.  This is also much easier to support, maintain and debug.
>
> 3. If you don't change the internals of the sketch then SerDe becomes
> pretty straightforward. I don't know if you need a single serialization
> that represents a full vector of sketches,  but if you do, then I would
> just iterate over the individual serdes and figure out how to package it.
> I really don't think you want to have to rewrite this low-level stuff.
>
> 4. Binary compatibility is critically important for us and I think will be
> important for you as well.  There are two dimensions of binary
> compatibility: history and language.  This means that a kll sketch
> serialized from Java, can be successfully read by C++ and visa versa.
> Similarly, a kll sketch serialized today will be able to be read many years
> from now.     Another aspect of this would mean being able to collect, say,
> 100 sketches that were not created using the NumPy version, and being able
> to put them together in a NumPy vector; and visa versa.
>
> I hope all of this make sense to you.
>
> Cheers,
>
> Lee.
>
>
>
> On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:
>
> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439903753&sdata=6CX64Wm8RXsSCeVoF3SqTLfgOKYeB9PoJqWmCJem2zc%3D&reserved=0>
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439903753&sdata=2sU281qEZ%2FYtg1EPjGf%2Bx%2B%2Fa%2BMitnuBXok72PELBDMc%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439913744&sdata=WH8QXNwiJ5W%2Bln5j%2B%2FY7lsLsLKE9mSwREvGliIVakwc%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439923741&sdata=ayJcko9BKjFBOZVtAy0JMdHD7hT2gzR9I0ge%2BRXRbzk%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439923741&sdata=kcF8PWEmbDKrinPPdyrv1mNangvTEvOztIpcNLC23gc%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes.  I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays.  Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header).  Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches.  Each of those sketches are kll_sketch objects and uses all of the existing code for that.  For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class.  Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format.  For deserialization, it would just iterate through them and load them into the kll_sketches object.  I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in).

Michael
________________________________
From: leerho <le...@gmail.com>
Sent: Sunday, May 10, 2020 8:42 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a closer look this next week.  They wrote this code so they are much closer to it than I.

What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible.  As soon as we start thinking about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to a library.  We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy.  If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches.  This would isolate the changes you need for NumPy from the sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches,  but if you do, then I would just iterate over the individual serdes and figure out how to package it.  I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be important for you as well.  There are two dimensions of binary compatibility: history and language.  This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized today will be able to be read many years from now.     Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439903753&sdata=6CX64Wm8RXsSCeVoF3SqTLfgOKYeB9PoJqWmCJem2zc%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439903753&sdata=2sU281qEZ%2FYtg1EPjGf%2Bx%2B%2Fa%2BMitnuBXok72PELBDMc%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439913744&sdata=WH8QXNwiJ5W%2Bln5j%2B%2FY7lsLsLKE9mSwREvGliIVakwc%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439923741&sdata=ayJcko9BKjFBOZVtAy0JMdHD7hT2gzR9I0ge%2BRXRbzk%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C3383b55df45c435add3208d7f5442a67%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247545439923741&sdata=kcF8PWEmbDKrinPPdyrv1mNangvTEvOztIpcNLC23gc%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a
closer look this next week.  They wrote this code so they are much closer
to it than I.

What you have done so far makes sense for you as you want to get this
working in the NumPy environment as quickly as possible.  As soon as we
start thinking about incorporating this into our library other concerns
become important.

1. Adding API calls is the recommended way to add functionality (like
NumPy) to a library.  We cannot change API calls in a way that is only
useful with NumPy, because it would seriously impact other users of the
library that don't need NumPy.  If both sets of calls cannot simultaneously
exist in the same sketch API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would
have to change the kll_sketch code itself other than perhaps a "wrapper"
class that enables vectorized input to a vector of sketches and a
vectorized get result that creates a vector result from a vector of
sketches.  This would isolate the changes you need for NumPy from the
sketch itself.  This is also much easier to support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes
pretty straightforward. I don't know if you need a single serialization
that represents a full vector of sketches,  but if you do, then I would
just iterate over the individual serdes and figure out how to package it.
I really don't think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be
important for you as well.  There are two dimensions of binary
compatibility: history and language.  This means that a kll sketch
serialized from Java, can be successfully read by C++ and visa versa.
Similarly, a kll sketch serialized today will be able to be read many years
from now.     Another aspect of this would mean being able to collect, say,
100 sketches that were not created using the NumPy version, and being able
to put them together in a NumPy vector; and visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho <le...@gmail.com> wrote:

> Michael,
> This is great!  What testing have you been able to do so far?
>
>
> On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083763402&sdata=8hoD1LWVL857LCe0nJtGFeF%2FHNk4yVKm%2BYhEuWAtaTc%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=%2FbROUq%2B1hItSCprW8BziTzQbM3%2BCHAWInQnXHSDFWeM%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <le...@gmail.com>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> dev@datasketches.apache.org email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> dev-subscribe@datasketches.apache.org.
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <jo...@gmail.com>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <de...@datasketches.apache.org>
>> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
>> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
>> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>
>> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
>> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=oVBbHOLxj1c4WzQr%2BZAROfSEZwmvIf4fpPKhVi409cY%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083783391&sdata=8xK%2F9E7wNDY30KDOfzmb5UPb2XEHeEaUnFAuhyIrX44%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <mh...@knights.ucf.edu>
wrote:

> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083763402&sdata=8hoD1LWVL857LCe0nJtGFeF%2FHNk4yVKm%2BYhEuWAtaTc%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=%2FbROUq%2B1hItSCprW8BziTzQbM3%2BCHAWInQnXHSDFWeM%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=oVBbHOLxj1c4WzQr%2BZAROfSEZwmvIf4fpPKhVi409cY%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083783391&sdata=8xK%2F9E7wNDY30KDOfzmb5UPb2XEHeEaUnFAuhyIrX44%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Lee,

Thanks for all of that information, it's quite helpful to get a better understanding of things.

I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <le...@gmail.com>
Sent: Sunday, May 10, 2020 5:32 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed.

Our library is split up among different repositories, determined by language and dependencies.  This keeps the user downloads smaller and more focused.   We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems.  We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally "released" as an Apache release.  It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy.  It is the one repo where we have all languages and many different dependencies.  The coding style is not as rigorous or as well documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should.  The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code.  Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results.  And when the data is large, this can take a long time.  You can peruse our website for many examples as all the plots result from various characterization studies.  What appears on the website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy.

For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution.  We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together.

We also try to do these same tests comparing the results against other alternatives users might have.  We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution.  It should be very similar to what you would write in a paper describing this solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083763402&sdata=8hoD1LWVL857LCe0nJtGFeF%2FHNk4yVKm%2BYhEuWAtaTc%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test.  These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=%2FbROUq%2B1hItSCprW8BziTzQbM3%2BCHAWInQnXHSDFWeM%3D&reserved=0> of key-value pairs.  We don't have any centralized definition of these pairs, just that they are human readable and intelligible.  They are different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>


We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=oVBbHOLxj1c4WzQr%2BZAROfSEZwmvIf4fpPKhVi409cY%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083783391&sdata=8xK%2F9E7wNDY30KDOfzmb5UPb2XEHeEaUnFAuhyIrX44%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

Michael,

Is there a place on GitHub somewhere where I could look at your code so
far?  The reason I ask, is before you do a PR, we would like to determine
where a contribution such as this should be placed.

Our library is split up among different repositories, determined by
language and dependencies.  This keeps the user downloads smaller and more
focused.   We have two library repos for the core sketch algorithms, one
for Java and one for C++/Python, where the dependencies are very lean,
which simplifies integration into other systems.  We have separate repos
for adaptors, which depend on one of the core repos. On the Java side, we
have separate repos for adaptors for Apache Hive and Apache Pig, as the
dependencies for each of these are quite large.  For C++, we have a
dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our
Druid adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
It is not clear to me at the moment whether we should create a separate
repo for this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not
formally "released" as an Apache release.  It exists because we want others
to be able to reproduce (or challenge) our claims of speed performance or
accuracy.  It is the one repo where we have all languages and many
different dependencies.  The coding style is not as rigorous or as well
documented as our repos that do have formal releases.

Characterization testing is distinctly different from Unit Tests, which
basically checks all the main code paths and makes sure that the program
works as it should.  The key metric is code coverage and Unit Tests should
be fast as it is run on every check-in of new code.  Characterization is
also different from Integration Testing, which is testing how well the code
works when integrated into larger systems.

Characterization tests are unique to our kind of library. Because our
algorithms are probabilistic in nature, in order to verify accuracy or
speed performance we need to run many thousands of trials to eliminate
statistical noise in the results.  And when the data is large, this can
take a long time.  You can peruse our website for many examples as all the
plots result from various characterization studies.  What appears on the
website is but a small fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to
decide what is important to measure for a particular sketch, but the basic
groups are *speed* and *accuracy*.

For speed there are many possible measurements, but the basic ones are
update speed, merge speed, Serialization / Deserialization speed, get
estimate or get result speeds.

For accuracy we want to validate that the sketch is performing within the
bounds of the theoretical error distribution.  We want to measure this
accuracy in the context of a stand-alone, purely streaming sketch and also
in the context of merging many sketches together.

We also try to do these same tests comparing the results against other
alternatives users might have.  We have performed these same
characterizations on other publically available sketches as well as against
traditional, brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what
properties would be most important to characterize for users of this
solution.  It should be very similar to what you would write in a paper
describing this solution;  you want to convince the reader that this is
very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I
would think you would want some characterizations similar to what we did
for our studies
<https://datasketches.apache.org/docs/Quantiles/KLLSketch.html> comparing
our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small
configuration files which define the key parameters of the test.
These are simple
text files
<https://github.com/apache/incubator-datasketches-characterization/tree/master/src/main/resources>
of key-value pairs.  We don't have any centralized definition of these
pairs, just that they are human readable and intelligible.  They are
different for each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is
one of our TODOs), but the same kind of parameters are set in the code
itself.

We will likely want to set up a separate directory for your
characterization tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes <mh...@knights.ucf.edu>
wrote:

> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <le...@gmail.com>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* dev@datasketches.apache.org <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cf4242f8e74a64a65e74708d7f46fba2d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246633018414378&sdata=dKfyb%2Bdtb4ne7ZHC4F90FHCeoqWMHwM8nYnuW1vok%2FU%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cf4242f8e74a64a65e74708d7f46fba2d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246633018424370&sdata=CQNPut8u5neJYpUwvGdpfacrBuJR1sC0jmucCe8ue2Y%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

The code is in a good state now.  It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays.  There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible!  Is there a standard set of tests for this purpose?  If not, are there standard tests that have been used for the existing codebase?

Michael
________________________________
From: leerho <le...@gmail.com>
Sent: Saturday, May 9, 2020 7:21 PM
To: dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is great.  The first step is to get your project working!  Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials.  You can see some of the accuracy and speed performance plots of various sketches on our website.  Sometimes these can take hours to run.  We typically use synthetic data to drive our characterization tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for us.  But before you get too deep into this please get us involved.  We have been characterizing these streaming algorithms for a number of years, and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cf4242f8e74a64a65e74708d7f46fba2d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246633018414378&sdata=dKfyb%2Bdtb4ne7ZHC4F90FHCeoqWMHwM8nYnuW1vok%2FU%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cf4242f8e74a64a65e74708d7f46fba2d%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246633018424370&sdata=CQNPut8u5neJYpUwvGdpfacrBuJR1sC0jmucCe8ue2Y%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by leerho <le...@gmail.com>.

This is great.  The first step is to get your project working!  Once you
think you are ready, it would be really useful if you could do some
characterization testing in the NumPy environment. Characterization tests
are what we run to fully understand how a sketch performs over a range of
parameters and using thousands to millions of trials.  You can see some of
the accuracy and speed performance plots of various sketches on our
website.  Sometimes these can take hours to run.  We typically use
synthetic data to drive our characterization tests to make them
reproducible.

Real data can also be used and one comparison test I would recommend is
comparing how long it takes to get approximate results using sketches
versus how long it would take to get exact results using brute force
methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment
for us.  But before you get too deep into this please get us involved.  We
have been characterizing these streaming algorithms for a number of years,
and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <mh...@knights.ucf.edu> wrote:

> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046519742&sdata=3lw%2BqdC8lUTjPK1fvpsR%2BJvq4GRVd8PfXixmSQcgT90%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046529736&sdata=2i6eaLbL5ctkgYSIZ2LUTh8S1DQ4cJKS0jDp63i1sA8%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

I'm not quite sure what being a committer entails, but yeah I'm happy to contribute.  I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner.  Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org <de...@datasketches.apache.org>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

This is just awesome!   Would you be interested in becoming a committer on our project?  It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library.  If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our dev@datasketches.apache.org<ma...@datasketches.apache.org> email list.

We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward.
You can subscribe by sending an empty email to dev-subscribe@datasketches.apache.org<ma...@datasketches.apache.org>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <jo...@gmail.com>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software
To: <de...@datasketches.apache.org>>
Cc: Lee Rhodes <lr...@verizonmedia.com>>, Edo Liberty <ed...@gmail.com>>, edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it.  That object would contain a std::vector<kll_sketch>.  Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: Edo Liberty <ed...@gmail.com>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046519742&sdata=3lw%2BqdC8lUTjPK1fvpsR%2BJvq4GRVd8PfXixmSQcgT90%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046529736&sdata=2i6eaLbL5ctkgYSIZ2LUTh8S1DQ4cJKS0jDp63i1sA8%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Lee Rhodes <lr...@verizonmedia.com.INVALID>.

This is just awesome!   Would you be interested in becoming a committer on
our project?  It is not automatic, but we could work with you to bring you
up to speed on the other sketches in the library.  If you could help us
integrate DataSketches into NumPy and possibly SciPy (not sure if this is
necessary) it would be a very significant contribution and we would
definitely want you to be part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes <mh...@knights.ucf.edu> wrote:

> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> dev@datasketches.apache.org email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> dev-subscribe@datasketches.apache.org.
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <jo...@gmail.com>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <de...@datasketches.apache.org>
> Cc: Lee Rhodes <lr...@verizonmedia.com>, Edo Liberty <
> edo.liberty@gmail.com>, edo@edoliberty.com <ed...@edoliberty.com>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C7004ed078386486ee69308d7f456c305%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246525809731138&sdata=NP28TU50tbeQaV7F5Jykdslr9%2FvZTyiMkz47DdlVkgY%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C7004ed078386486ee69308d7f456c305%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246525809741135&sdata=r6WCiuh%2FBtk8N%2BFYz%2BF6StHb0Q7vib4iSb7yBG3vKmU%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

We're using pybind11 to get a C++ interface with python (vs raw C). The
wrappers themselves are quite thin, but they do have examples of calling
functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++
object and create a pybind wrapper for it.  That object would contain a
std::vector<kll_sketch>.  Then you'd define an update method for your
custom object that iterates through a numpy array and calls update() on the
appropriate sketch. You'd also want to define something similar for
get_quantile() or whatever other methods you need that iterates through
that vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin
pybind wrapper around it to make it play nicely with python. Since our C++
library is just templates, you'd end up with a free-standing library, with
no requirement that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes <mh...@knights.ucf.edu> wrote:

> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <
> dev@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1fc27d10860b43f7941d08d7f29fc9d0%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244640426334632&sdata=lIgRTDckLCNN%2BdYPSwnM5sWBkc85ZT0EMgAq0JTbbaE%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1fc27d10860b43f7941d08d7f29fc9d0%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244640426344627&sdata=UH%2Fp9r5rlisXuBhGR%2Fs5%2Bq4Uu6yfn9aHbO0V5mD2PFU%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

I would be happy to share whatever I come up with (if anything).  The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy.  On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it.

Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <mh...@knights.ucf.edu>
Cc: Edo Liberty <ed...@gmail.com>; dev@datasketches.apache.org <de...@datasketches.apache.org>; edo@edoliberty.com <ed...@edoliberty.com>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

If you do figure out how to do this, it would be great if you could share it with us.  We would like to extend  it to other sketches and submit it as an added functionality to NumPy.  I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>>
Cc: dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>; Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1fc27d10860b43f7941d08d7f29fc9d0%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244640426334632&sdata=lIgRTDckLCNN%2BdYPSwnM5sWBkc85ZT0EMgAq0JTbbaE%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C1fc27d10860b43f7941d08d7f29fc9d0%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244640426344627&sdata=UH%2Fp9r5rlisXuBhGR%2Fs5%2Bq4Uu6yfn9aHbO0V5mD2PFU%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael
--
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Lee Rhodes <lr...@verizonmedia.com.INVALID>.

If you do figure out how to do this, it would be great if you could share
it with us.  We would like to extend  it to other sketches and submit it as
an added functionality to NumPy.  I have been looking at the NomPy and
SciPy libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes <mh...@knights.ucf.edu> wrote:

> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>
> *Cc:* dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo
> Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdccea46249df4dbfbf8408d7f2306c02%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244162122715067&sdata=SO4Iv2whU%2FqASjJsyPHszqrJp1rKQOOSuOsvkN9X2Cw%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdccea46249df4dbfbf8408d7f2306c02%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244162122715067&sdata=7eHQ%2Bmrfyl91HMle7LmznjKyg01OSG4JbKNGpco0EOQ%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Michael Himes <mh...@knights.ucf.edu>.

Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values).  I am interested in quantiles at rank r for each of the m streams.  Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go.  Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <mh...@knights.ucf.edu>
Cc: dev@datasketches.apache.org <de...@datasketches.apache.org>; Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <ed...@edoliberty.com>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions.  Otherwise you might get unwanted correlation effects between the dimensions.

This is another argument that you should have independent compactors for each dimension.  So you might as well stick with individual sketches for each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch and not just the compactors.  The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)?  Which also means serialization and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, nulls, etc.  Which means we would not be able to vectorize the compactor.  Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++.   It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time.  In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me.  Here's what I did:
- began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdccea46249df4dbfbf8408d7f2306c02%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244162122715067&sdata=SO4Iv2whU%2FqASjJsyPHszqrJp1rKQOOSuOsvkN9X2Cw%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream.  Then, later when getting a given quantile, it would give 10 values, one for each stream.  I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append().  Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned.  I'll work on this and see if it really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <lr...@verizonmedia.com>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <mh...@knights.ucf.edu>>; dev@datasketches.apache.org<ma...@datasketches.apache.org> <de...@datasketches.apache.org>>
Cc: Edo Liberty <ed...@gmail.com>>; edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread to our dev@datasketches.apache.org<ma...@datasketches.apache.org> so that our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks.  We have fixed a number of stability issues and bugs, which may solve the problem.  Nonetheless, we want to work with you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have real-time systems today that generate and process over 1e9 sketches every day.  Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++.  This is because the code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be "easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code.  To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly.  I am not sure what would be involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed.  By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors.  I ran a few unit tests on both codes and found equivalent behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known limitations of the streaming-quantiles code that would cause issues for my use case?  Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles?  I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication.

Best,
Michael

________________________________
From: Edo Liberty <ed...@gmail.com>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <lr...@verizonmedia.com>>; Michael Himes <mh...@knights.ucf.edu>>
Cc: edo@edoliberty.com<ma...@edoliberty.com> <ed...@edoliberty.com>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :)

On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling.  It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cdccea46249df4dbfbf8408d7f2306c02%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637244162122715067&sdata=7eHQ%2Bmrfyl91HMle7LmznjKyg01OSG4JbKNGpco0EOQ%3D&reserved=0>), and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory.  After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code?  I don't see a license, so I wanted to make sure you're okay with this.  I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer.

Best,
Michael

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Lee Rhodes <lr...@verizonmedia.com.INVALID>.

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
work for another reason and that is for each dimension, choosing whether to
delete the odd or even values in the compactor must be random and
independent of the other dimensions.  Otherwise you might get unwanted
correlation effects between the dimensions.

This is another argument that you should have independent compactors for
each dimension.  So you might as well stick with individual sketches for
each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <lr...@verizonmedia.com> wrote:

> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C20300349fe264123e49908d7f1fd4cbe%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637243942528081567&sdata=pl4piN5odQiUO2SkMq%2FLRL0UqWOrqkimd0c12RpdpY4%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Jon Malkin <jo...@gmail.com>.

I think the key here is that for an m dimensional vector you'd need to
ensure that you have m independent offsets when doing each compaction. You
can't share bits (beyond the fact that roughly half of them will match
since there are only 2 options, of course) between dimensions for that.

  jon

On Wed, May 6, 2020 at 5:16 PM Lee Rhodes <lr...@verizonmedia.com.invalid>
wrote:

> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <lr...@verizonmedia.com>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
>> <de...@datasketches.apache.org>
>> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
>> edo@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our dev@datasketches.apache.org so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <ed...@gmail.com>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
>> mhimes@knights.ucf.edu>
>> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C20300349fe264123e49908d7f1fd4cbe%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637243942528081567&sdata=pl4piN5odQiUO2SkMq%2FLRL0UqWOrqkimd0c12RpdpY4%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Posted by Lee Rhodes <lr...@verizonmedia.com.INVALID>.

Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
element, or equivalently, the *i*th dimension.

Assumptions:

   - All vectors, *V*, are of the same size *m.*
   - All elements, *x_i*, are valid numbers of the same type. No missing
   values, and if you are using *floats*, this means no *NaN*s.

In aggregate, the *n* vectors represent *m* *independent* distributions of
values.

Your task is to be able to obtain *m* quantiles at rank *r* in a single
query.

****
To do this, using your idea, would require vectorization of the entire
sketch and not just the compactors.  The inputs are vectors, the result of
operations such as getQuantile(r), getQuantileUpperBound(r),
getQuantileLowerBound(r), are also vectors.

This sketch will be a large data structure, which leads to more questions
...

   - Do you anticipate having many of these vectorized sketches operating
   simultaneously?
   - Is there any requirement to store and later retrieve this sketch?
   - Or, the nearly equivalent question: Do you require merging of these
   sketches (across clusters, for example)?  Which also means serialization
   and deserialization.

I am concerned that this vector-quantiles sketch would be limited in the
sense that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN,
nulls, etc.  Which means we would not be able to vectorize the compactor.
Each dimension *i* would need a separate independent compactor because the
compaction times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each
dimension would be much smaller than vectorizing the entire sketch, because
the internals of the existing sketch are already quite space efficient
leveraging compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data
structure on the C++ side, having individual sketches for each
dimension, and doing the iterations updating the sketches in C++.   It also
has the advantage of leveraging code that exists and it would automatically
be able to leverage any improvements to the sketch code over time.  In
addition, it could be a prototype of how to integrate other sketches into
the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not
be able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes <mh...@knights.ucf.edu> wrote:

> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <lr...@verizonmedia.com>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <mh...@knights.ucf.edu>; dev@datasketches.apache.org
> <de...@datasketches.apache.org>
> *Cc:* Edo Liberty <ed...@gmail.com>; edo@edoliberty.com <
> edo@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our dev@datasketches.apache.org so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <mh...@knights.ucf.edu>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <ed...@gmail.com>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <lr...@verizonmedia.com>; Michael Himes <
> mhimes@knights.ucf.edu>
> *Cc:* edo@edoliberty.com <ed...@edoliberty.com>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <mh...@knights.ucf.edu> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C20300349fe264123e49908d7f1fd4cbe%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637243942528081567&sdata=pl4piN5odQiUO2SkMq%2FLRL0UqWOrqkimd0c12RpdpY4%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
>