You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@incubator.apache.org by le...@gmail.com, le...@gmail.com on 2019/02/23 06:21:15 UTC

DataSketches Proposal

Thanks for the offer.  i am a neophyte at this process and email app!   I could use a lot of help getting this off the ground!  Also, I'm not sure that Mr. Chen and Mr. Onofré have fully accepted taking this on :)

Lee.

On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote: 
> Nice.
> 
> I would very much like to help mentor this project, though you already have
> a couple good ones.
> 
> I concur with incubator as sponsoring entity.
> 
> Kenn (VP Apache Beam)
> 
> On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> 
> > I didn't realize that this mail list does not accept PDF files, apparently
> > only text.  So let me try one more time ... :)  Please let me know if
> > this works!
> >
> >
> > = Apache DataSketches Proposal[1] =
> >
> > == Abstract ==
> >
> > DataSketches.GitHub.io is an open source, high-performance library of
> > stochastic streaming algorithms commonly called "sketches" in the data
> > sciences. Sketches are small, stateful programs that process massive data
> > as a stream and can provide approximate answers, with mathematical
> > guarantees, to computationally difficult queries orders-of-magnitude faster
> > than traditional, exact methods.
> >
> > This proposal is to move DataSketches to the Apache Software
> > Foundation(ASF) transferring ownership of its copyright intellectual
> > property to the ASF.  Thereafter, DataSketches would be officially known as
> > Apache DataSketches and its evolution and governance would come under the
> > rules and guidance of the ASF.
> >
> > == Introduction ==
> >
> > The DataSketches library contains carefully crafted implementations of
> > sketch algorithms that meet rigorous standards of quality and performance
> > and provide capabilities required for large-scale production systems that
> > must process and analyze massive data. The DataSketches core repository is
> > written in Java with a parallel core repository written in C++ that
> > includes Python wrappers. The DataSketches library also includes special
> > repositories for extending the core library for Apache Hive and Apache Pig.
> > The sketches developed in the different languages share a common binary
> > storage format so that sketches created and stored in Java, for example,
> > can be fully used in C++, and visa versa.  Because the stored sketch
> > "images" are just a "blob" of bytes (similar to picture images), they can
> > be shared across many different systems, languages and platforms.
> >
> > The DataSketches documentation website, https://datasketches.github.io ,
> > includes general tutorials, a comprehensive research section with
> > references to relevant academic papers, extensive examples for using the
> > core library directly as well as examples for accessing the library in
> > Hive, Pig, and Apache Spark.
> >
> > The DataSketches library also includes a characterization repository for
> > long running test programs that are used for studying accuracy and
> > performance of these sketches over wide ranges of input variables. The data
> > produced by these programs is used for generating the many performance
> > plots contained in the documentation website and for academic
> > publications.
> >
> > The code repositories used for production are versioned and published to
> > Maven Central on periodic intervals as the library evolves.
> >
> > The DataSketches library also includes several experimental repositories
> > for use-cases outside the large-scale systems environments, such as
> > sketches for mobile, IoT devices (Android), command-line access of the
> > sketch library, and an experimental repository for vector-based sketches
> > that performs approximate Singular Value Decomposition (SVD) analysis that
> > could potentially be used in Machine Learning (ML) applications.
> >
> > == Background ==
> >
> > The DataSketches library was started in 2012 as internal Yahoo project to
> > dramatically reduce time and resources required for distinct (unique)
> > counting.  An extensive search on the Internet at the time yielded a number
> > of theoretical papers on stochastic streaming algorithms with pseudocode
> > examples, but we did not find any usable open-source code of the quality we
> > felt we needed for our internal production systems.  So we started a small
> > project (one person) to develop our own sketches working directly from
> > published theoretical papers.
> >
> > The DataSketches library was designed from the start with the objective of
> > making these algorithms, usually only described in theoretical papers,
> > easily accessible to systems developers for use in our internal production
> > systems. By necessity, the code had to be of the highest quality and
> > thoroughly tested. The wide variety of our internal production systems
> > drove the requirement that the sketch implementations had to have an
> > absolute minimum of external, run-time dependencies in order to simplify
> > integration and troubleshooting.
> >
> > Our internal experiments demonstrated dramatic positive impact on the
> > performance of our systems.  As a result, the DataSketches library quickly
> > evolved to include different types of sketches for different types of
> > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > quantile/histogram algorithms, and weighted and unweighted sampling
> > algorithms.
> >
> > We quickly discovered that developing these sketch algorithms to be truly
> > robust in production environments is quite difficult and requires deep
> > understanding of the underlying mathematics and statistics as well as
> > extensive experience in developing high quality code for 24/7 production
> > systems. This is a difficult combination of skills for any one organization
> > to collect and maintain over time. It became clear that this technology
> > needed a community larger than Yahoo to evolve.  In November, 2015, this
> > factor, along with Yahoo’s strong experience and support of open source,
> > led to the decision to open source this technology under an Apache 2.0
> > license on GitHub. Since that time our community has expanded considerably
> > and the key contributors to this effort includes leading research
> > scientists from a number of universities as well as practitioners and
> > researchers from a number of major corporations. The core of this group is
> > very active as we meet weekly to discuss research directions and
> > engineering priorities.
> >
> > It is important to note that our internal systems at Yahoo use the current
> > public GitHub open source DataSketches library and not an internal version
> > of the code.
> >
> > The close collaboration of scientific research and engineering development
> > experience with actual massive-data processing systems has also produced
> > new research publications in the field of stochastic streaming algorithms,
> > for example:
> >
> > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and
> > Justin Thaler. A high-performance algorithm for identifying frequent items
> > in data streams. In ACM IMC 2017.
> >
> > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > framework for estimating stream expression cardinalities. In *EDBT/ICDT
> > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> >
> > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings ‘16,
> > pages 845-854, 2016.
> >
> > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78, 2016.
> >
> > * Kevin J Lang. Back to the future: an even more nearly optimal cardinality
> > estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839,
> > 2017.
> >
> > * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
> > Proceedings ‘13, pages 581– 588, 2013.
> >
> > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman.
> > Space lower bounds for itemset frequency sketches. In ACM PODS Proceedings
> > ‘16, pages 441–454, 2016.
> >
> > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical
> > heavy hitters with the space saving algorithm. In SIAM ALENEX Proceedings
> > ‘12, pages 160–174, 2012.
> >
> > == The Rationale for Sketches ==
> >
> > In the analysis of big data there are often problem queries that don’t
> > scale because they require huge compute resources and time to generate
> > exact results. Examples include count distinct, quantiles, most frequent
> > items, joins, matrix computations, and graph analysis.
> >
> > If we can loosen the requirement of “exact” results from our queries and be
> > satisfied with approximate results, within some well understood bounds of
> > error, there is an entire branch of mathematics and data science that has
> > evolved around developing algorithms that can produce approximate results
> > with mathematically well-defined error properties.
> >
> > With the additional requirements that these algorithms must be small
> > (compared to the size of the input data), sublinear (the size of the sketch
> > must grow at a slower rate than the size of the input stream), streaming
> > (they can only touch each data item once), and mergeable (suitable for
> > distributed processing), defines a class of algorithms that can be
> > described as small, stochastic, streaming, sublinear mergeable algorithms,
> > commonly called sketches (they also have other names, but we will use the
> > term sketches from here on).
> >
> > To be truly streaming and be able to process data in a single pass,
> > sketches must make absolute minimum assumptions about the input stream.
> > This is critically important, as there is no “second chance” to process the
> > data.
> >
> > For example, sketches should not make assumptions about the order of stream
> > items, the stream length, the dynamic range of values, or the distribution
> > of item occurrence frequencies. Sketches should be tolerant of NaNs, Nulls
> > and empty objects. About the only thing that the sketch needs to know about
> > the stream is how to extract items from it and what type the item is, e.g.,
> > is it a numeric value or a string.
> >
> > As far as the sketch is concerned, the input stream is a sequence of items
> > in some unknown random order with unknown random values.
> >
> > The sketch is essentially a complex state machine and combined with the
> > random input stream defines a stochastic process. We then apply
> > probabilistic methods to interpret the states of the stochastic process in
> > order to extract useful information about the input stream itself. The
> > resulting information will be approximate, but we also use additional
> > probabilistic methods to extract an estimate of the likely probability
> > distribution of error.
> >
> > There is a significant scientific contribution here that is defining the
> > state machine, understanding the resulting stochastic process, developing
> > the probabilistic methods, and proving mathematically, that it all works!
> > This is why the scientific contributors to this project are a critical and
> > strategic component to our success.  The development engineers translate
> > the concepts of the proposed state machine and probabilistic methods into
> > production-quality code. Even more important, they work closely with the
> > scientists, feeding back system and user requirements, which leads not only
> > to superior product design, but to new science as well.  A number of
> > scientific papers our members have published (see above) is a direct result
> > of this close collaboration.
> >
> > Because sketches are small they can be processed extremely fast, often many
> > orders-of-magnitude faster than traditional exact computations. For
> > interactive queries there may not be other viable alternatives, and in the
> > case of real-time analysis, sketches are the only known solution.
> >
> > For any system that needs to extract useful information from massive data
> > sketches are essential tools that should be tightly integrated into the
> > system’s analysis capabilities. This technology has helped Yahoo
> > successfully reduce data processing times from days to hours or minutes on
> > a number of its internal platforms and has enabled subsecond queries on
> > real-time platforms that would have been infeasible without sketches.
> > The Rationale for Apache DataSketches
> > Other open source implementations of sketch algorithms can be found on the
> > Internet. However, we have not yet found any open source implementations
> > that are as comprehensive, engineered with the quality required for
> > production systems, and with usable and guaranteed error properties.  Large
> > Internet companies, such as Google and Facebook, have published papers on
> > sketching, however, their implementations of their published algorithms are
> > proprietary and not available as open source.
> >
> > The DataSketches library already provides integrations with a number of
> > major Apache data processing platforms such as Apache Hive, Apache Pig,
> > Apache Spark and Apache Druid, and is also integrated with a number of
> > other open source data processing platforms such as Splice Machine, GCHQ
> > Gaffer and PostgreSQL.
> >
> > We believe that having DataSketches as an Apache project will provide an
> > immediate, worthwhile, and substantial contribution to the open source
> > community, will have a better opportunity to provide a meaningful
> > contribution to both the science and engineering of sketching algorithms,
> > and integrate with other Apache projects.  In addition, this is a
> > significant opportunity for Apache to be the "go-to" destination for users
> > that want to leverage this exciting technology.
> >
> > == Initial Goals ==
> >
> > We are breaking our initial goals into short-term (2-6 months) and
> > intermediate to long-term ( 6 months to 2 years):
> >
> > Our short-term goals include:
> >
> > * Understanding and adapting to the Apache development process and
> > structures.
> >
> > * Start refactoring codebase and move various DataSketches repositories
> > code to Apache Git repository.
> >
> > * Continue development of new features, functions, and fixes.
> >
> > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > developed and expanded.
> >
> >
> > The intermediate to long term goals include:
> >
> > * Completing the design and implementation of the C++ sketches to
> > complement what is already available in Java, and the Python wrappers of
> > those C++ sketches.
> >
> > * Expanding the C++ build framework to include Windows and the popular
> > Linux variants.
> >
> > * Continued engagement with the scientific research community on the
> > development of new algorithms for computationally difficult problems that
> > heretofore have not had a sketching solution.
> >
> > == Current Status ==
> >
> > The DataSketches GitHub project has been quite successful.  As of this
> > writing (Feb, 2019) the number of downloads measured by the Nexus
> > Repository Manager at https://oss.sonatype.org has grown by nearly a
> > factor
> > of 10 over the past year to about 55 thousand per month. The
> > DataSketches/sketches-core repository has about 560 stars and 141 forks,
> > which is pretty good for a highly specialized library.
> >
> > === Development Practices ===
> >
> > ==== Source Control ====
> >
> > All of our developers have extensive experience with Git version control
> > and follow accepted practices for use of Pull Requests (PRs), code reviews
> > and commits to master, for example.
> >
> > ==== Testing ====
> >
> > Sketches, by their nature are probabilistic programs and don’t necessarily
> > behave deterministically.  For some of the sketches we intentionally insert
> > random noise into the code as this gives us the mathematical properties
> > that we need to guarantee accuracy.  This can make the behavior of these
> > algorithms quite unintuitive and provides significant challenges to the
> > developer who wishes to test these algorithms for correctness. As a result,
> > our testing strategy includes two major components: unit tests, and
> > characterization tests.
> >
> > ===== Unit Testing =====
> >
> > Our unit tests are primarily quick tests to make sure that we exercise all
> > critical paths in the code and that key branches are executed correctly. It
> > is important that they execute relatively fast as they are generally run on
> > every code build. The sketches-core repository alone has about 22 thousand
> > statements, over 1300 unit tests and code coverage of about 98.2% as
> > measured by Atlassian/Clover.  It is our goal for all of our code
> > repositories that are used in production that they have code coverage
> > greater than 90%.
> >
> > ===== Characterization Testing =====
> >
> > In order to test the probabilistic methods that are used to interpret the
> > stochastic behaviors of our sketches we have a separate characterization
> > repository that is dedicated to this.  To measure accuracy, for example,
> > requires running thousands of trials at each of many different points along
> > the domain axis. Each trial compares its estimated results against a known
> > exact result producing an error for that trial.  These error measurements
> > are then fed into our Quantiles sketch to capture the actual distribution
> > of error at that point along the axis. We then select quantile contours
> > across all the distributions at points along the axis.  These contours can
> > then be plotted to reveal the shape of the actual error distribution. These
> > distributions are not at all Gaussian, in fact they can be quite complex.
> > Nonetheless, these distributions are then checked against our statistical
> > guarantees inherent to the specific sketch algorithm and its parameters.
> > There are many examples of these characterization error distributions on
> > our website. The runtimes of these tests can be very long and can range
> > from many minutes to hours, and some can run for days.  Currently, we have
> > separate characterization repositories for Java and C++ / Python.
> >
> > It is our goal that we perform this characterization analysis for all of
> > our sketches.  By definition, the code that runs these characterization
> > tests is open-source so others can run these tests as well.  We do not have
> > formal releases of this code (because it is not production code) and it is
> > not published to Maven Central.
> >
> > === Meritocracy ===
> >
> > DataSketches was initially developed based on requirements within Yahoo. As
> > a project on GitHub, DataSketches has received contributions from numerous
> > individual developers from around the world, dedicated research work from
> > senior scientists at Amazon and Visa, and academic researchers from
> > Georgetown University, Princeton, and MIT.
> >
> > As a project under incubation, we are committed to expanding our effort to
> > build an environment which supports a meritocracy. We are focused on
> > engaging the community and other related projects for support and
> > contributions. Moreover, we are committed to ensure contributors and
> > committers to DataSketches come from a broad mix of organizations through a
> > merit-based decision process during incubation. We believe strongly in the
> > DataSketches premise that fulfills the concept of a well engineered and
> > scientifically rigorous library that implements these powerful algorithms
> > and are committed to growing an inclusive community of DataSketches
> > contributors and users.
> >
> > === Community ===
> >
> > Yahoo has a long history and active engagement in the Open Source
> > community. Major projects include: Vespa.ai, Bullet, Moloch, Panoptes,
> > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark, gifshot,
> > fluxible, as well as the creation, contribution and incubation of many
> > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie, Zookeeper,
> > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> >
> > Every day, DataSketches is actively used by a organizations and
> > institutions around the world for batch and stream processing of data. We
> > believe acceptance will allow us to consolidate existing
> > DataSketches-related work, grow the DataSketches community, and deepen
> > connections between DataSketches and other open source projects.
> >
> > === Introduction to the Core Developers & Contributors ===
> >
> > The core developers and contributors for DataSketches are from diverse
> > backgrounds, but primarily are scientists that love engineering and
> > engineers that love science. A large part of the value we bring comes from
> > this synthesis.  These individuals have already contributed substantially
> > to the code, algorithms, and/or mathematical proofs that form the basis of
> > the library.
> >
> > This core group also form the Initial Committers with write permissions to
> > the repository. Those marked with (*) Meet weekly to plan the research and
> > engineering direction of the project.
> >
> > ==== Scientists That Love Engineering ====
> >
> > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests:
> > distributed systems, scalable systems and platforms for big data
> > processing, concurrent algorithms and data structures,
> >
> > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale,
> > California. Interests: algorithms, theoretical and applied mathematics,
> > encoding and compression theory, theoretical and applied performance
> > optimization.
> >
> > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alto,
> > California. Manages the algorithms group at Amazon AI. We build scalable
> > machine learning systems and algorithms which are used both internally and
> > externally by customers of SageMaker, AWS's flagship machine learning
> > platform.
> >
> > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
> > Computational advertising, machine learning, speech recognition,
> > data-driven analysis, large scale experimentation, big data, stream/complex
> > event processing
> >
> > * Justin Thaler: (*) Assistant Professor, Department of Computer Science,
> > Georgetown University, Washington D.C. Interests: algorithms and
> > computational complexity, complexity theory, quantum algorithms, private
> > data analysis, and learning theory, developing efficient streaming and
> > sketching algorithms
> >
> > ==== Engineers That Love Science ====
> >
> > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap. Interests:
> > design and implementation of data storing and data processing (distributed)
> > systems, performance optimization, CPU performance, mechanical sympathy,
> > JVM performance, API design, databases, (concurrent) data structures,
> > memory management, garbage collection algorithms, language design and
> > runtimes (their tradeoffs), distributed systems (cloud) efficiency, Linux,
> > code quality, code transformation, pure functional programming models,
> > Haskell.
> >
> > * Lee Rhodes: (*) Distinguished Architect, lead developer and founder of
> > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > streaming algorithms, mathematics, computer science, high quality and high
> > performance code for the analysis of massive data, bridging the divide
> > between theory and practice.
> >
> > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
> > California. Interests: applied mathematics, computer science, big data,
> > distributed systems.
> >
> > === Introduction to Additional Interested Contributors ===
> >
> > These folks have been intermittently involved and contributed, but are
> > strong supporters of this project.
> >
> > * Frank Grimes: GitHub ID: frankgrimes97
> >
> > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science,
> > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > approximation, streaming algorithms, randomized linear algebra.
> >
> > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer
> > Science, Research Instructor, Princeton University. Interests: algorithmic
> > foundations of data science and machine learning, efficient methods for
> > processing and understanding large datasets, often working at the
> > intersection of theoretical computer science, numerical linear algebra, and
> > optimization.
> >
> > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science,
> > Professor, Warwick University, Warwick, England. Interests: all aspects of
> > the "data lifecycle", from data collection and cleaning, through mining and
> > analytics. (Professor Cormode is one of the world’s leading scientists in
> > sketching algorithms)
> >
> > === Alignment ===
> >
> > The DataSketches library already provides integrations and example code for
> > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into Apache
> > Druid.
> >
> > == Known Risks ==
> >
> > The following subsections are specific risks that have been identified by
> > the ASF that need to be addressed.
> >
> > === Risk: Orphaned Products ===
> >
> > The DataSketches library is presently used by a number of organizations,
> > from small startups to Fortune 100 companies, to construct production
> > pipelines that must process and analyze massive data. Yahoo has a long-term
> > commitment to continue to advance the DataSketches library; moreover,
> > DataSketches is seeing increasing interest, development, and adoption from
> > many diverse organizations from around the world. Due to its growing
> > adoption, we feel it is quite unlikely that this project would become
> > orphaned.
> >
> > === Risk: Inexperience with Open Source ===
> >
> > Yahoo believes strongly in open source and the exchange of information to
> > advance new ideas and work. Examples of this commitment are active open
> > source projects such as those mentioned above. With DataSketches, we have
> > been increasingly open and forward-looking; we have published a number of
> > papers about breakthrough developments in the science of streaming
> > algorithms (mentioned above) that also reference the DataSketches library.
> > Our submission to the Apache Software Foundation is a logical extension of
> > our commitment to open source software.
> >
> > Key committers at Yahoo with strong open source backgrounds include Aaron
> > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews
> > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn
> > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel,
> > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal, Gil
> > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James Penick,
> > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles, Kihwal
> > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
> > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. Natkovich,
> > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo, Ryan
> > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan, Sri
> > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> >
> > All of our core developers are committed to learn about the Apache process
> > and to give back to the community.
> >
> > === Risk: Homogeneous Developers ===
> >
> > The majority of committers in this proposal belong to Yahoo due to the fact
> > that DataSketches has emerged from an internal Yahoo project. This proposal
> > also includes developers and contributors from other companies, and who are
> > actively involved with other Apache projects, such as Druid.  We expect our
> > entry into incubation will allow us to expand the number of individuals and
> > organizations participating in DataSketches development.
> >
> > === Risk: Reliance on Salaried Developers ===
> >
> > Because the DataSketches library originated within Yahoo, it has been
> > developed primarily by salaried Yahoo developers and we expect that to
> > continue to be the case near term. However, since we placed this library
> > into open-source we have had a number of significant contributions from
> > engineers and scientists from outside of Yahoo. We expect our reliance on
> > Yahoo salaried developers will decrease over time. Nonetheless, Yahoo is
> > committed to continue its strong support of this important project.
> >
> > === Risk: Lack of Relationship to other Apache Products ===
> >
> > DataSketches already directly interoperates with or utilizes several
> > existing Apache projects.
> >
> > * Build
> >    * Apache Maven
> >
> > * Integrations and adaptors for the following projects naturally have them
> > as dependencies
> >    * Apache Hive
> >    * Apache Pig
> >    * Apache Druid
> >    * Apache Spark
> >
> > * Additional dependencies for the above integrations and adaptors include
> >    * Apache Hadoop
> >    * Apache Commons (Math)
> >
> > There is no other Apache project that we are aware of that duplicates the
> > functionality of the DataSketches library.
> >
> > === Risk: An Excessive Fascination with the Apache Brand ===
> >
> > With this proposal we are not seeking attention or publicity. Rather, we
> > firmly believe in the DataSketches library and concept and the ability to
> > make the DataSketches library a powerful, yet simple-to-use toolkit for
> > data processing. While the DataSketches library has been open source, we
> > believe putting code on GitHub can only go so far. We see the Apache
> > community, processes, and mission as critical for ensuring the DataSketches
> > library is truly community-driven, positively impactful, and innovative
> > open source software. While Yahoo has taken a number of steps to advance
> > its various open source projects, we believe the DataSketches library
> > project is a great fit for the Apache Software Foundation due to its focus
> > on data processing and its relationships to existing ASF projects.
> >
> > === Risk: Cryptography ===
> >
> > DataSketches does not contain any cryptographic code and is not a
> > cryptographic product.
> >
> > == Documentation ==
> >
> > The following documentation is relevant to this proposal. Relevant portions
> > of the documentation will be contributed to the Apache DataSketches
> > project.
> >
> > * DataSketches website: https://datasketches.github.io.
> >
> > * DataSketches website repository:
> > https://github.com/DataSketches/DataSketches.github.io
> >
> > We will need an apache website for this documentation similar to
> >
> > * https://datasketches.apache.org
> >
> > == Initial Source ==
> >
> > The initial source for DataSketches which we will submit to the Apache
> > Foundation will include a number of repositories which are currently hosted
> > under the GitHub.com/datasketches organization:
> >
> > All github.com/datasketches repositories including:
> >
> > * Java
> >    * sketches-core: This repository has the core sketching classes, which
> > are leveraged by some of the other repositories. This repository has no
> > external dependencies outside of the DataSketches/memory repository, Java
> > and TestNG for unit tests. This code is versioned and the latest release
> > can be obtained from Maven Central.
> >    * memory: Low level, high-performance memory data-structure management
> > primarily for off-heap.
> >    * sketches-android: This is a new repository dedicated to sketches
> > designed to be run in a mobile client, such as a cell phone. It is still in
> > development and should be considered experimental.
> >    * sketches-hive: This repository contains Hive UDFs and UDAFs for use
> > within Hadoop grid environments. This code has dependencies on
> > sketches-core as well as Hadoop and Hive. Users of this code are advised to
> > use Maven to bring in all the required dependencies. This code is versioned
> > and the latest release can be obtained from Maven Central.
> >    * sketches-pig: This repository contains Pig User Defined Functions
> > (UDF) for use within Hadoop grid environments. This code has dependencies
> > on sketches-core as well as Hadoop and Pig. Users of this code are advised
> > to use Maven to bring in all the required dependencies. This code is
> > versioned and the latest release can be obtained from Maven Central.
> >    * sketches-vector: This is a new repository dedicated to sketches for
> > vector and matrix operations. It is still somewhat experimental.
> >    * characterization: This relatively new repository is for code that we
> > use to characterize the accuracy and speed performance of the sketches in
> > the library and is constantly being updated. Examples of the job command
> > files used for various tests can be found in the src/main/resources
> > directory. Some of these tests can run for hours depending on its
> > configuration.
> >    * experimental: This repository is an experimental staging area for code
> > that will eventually end up in another repository. This code is not
> > versioned and not registered with Maven Central.
> >    * sketches-misc: Demos and other code not related to production
> > deployment
> >
> > * C++ and Python
> >    * sketches-core-cpp: This is the C++/Python companion to the Java
> > sketches-core. These implementations are binary compatible with their
> > counterparts in Java. In other words, a sketch created and stored in C++
> > can be opened and read in Java and visa-versa. This site also has our
> > Python adaptors that basically wrap the C++ implementations, making the
> > high performance C++ implementations available from Python.
> >    * sketches-postgres: This site provides the postgres-specific adaptors
> > that wrap the C++ implementations making them available to the Postgres
> > database users.
> >    * characterization-cpp: This is the C++/Python companion to the Java
> > characterization repository.
> >    * experimental-cpp: This repository is an experimental staging area for
> > C++ code that will eventually end up in another repository.
> >
> > * Command-Line Tools
> >    * sketches-cmd
> >    * homebrew-sketches
> >    * homebrew-sketches-cmd
> >
> > These projects have always been Apache 2.0 licensed. We intend to bundle
> > all of these repositories since they are all complementary and should be
> > maintained in one project. Prior to our submission, we will combine all of
> > these projects into a new git repository.
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > Contributors to the DataSketches project have also signed the Yahoo
> > Individual Contributor License Agreement (https://yahoocla.herokuapp.com/
> > in order to contribute to the project.
> >
> > With respect to trademark rights, Yahoo does not hold a trademark on the
> > phrase “DataSketches.” Based on feedback and guidance we receive during the
> > incubation process, we are open to renaming the project if necessary for
> > trademark or other concerns, but we would prefer not to have to do that.
> >
> > == External Dependencies ==
> >
> > All external dependencies are licensed under an Apache 2.0 or
> > Apache-compatible license. As we grow the DataSketches community we will
> > configure our build process to require and validate all contributions and
> > dependencies are licensed under the Apache 2.0 license or are under an
> > Apache-compatible license.
> >
> > == Required Resources ==
> >
> > === Mailing Lists ===
> >
> > We currently use a mix of mailing lists. We will migrate our existing
> > mailing lists to the following:
> >
> > * dev@datasketches.incubator.apache.org
> >
> > * user@datasketches.incubator.apache.org
> >
> > * private@datasketches.incubator.apache.org
> >
> > * commits@datasketches.incubator.apache.org
> >
> > === Source Control ===
> >
> > The DataSketches team currently uses Git and would like to continue to do
> > so. We request a Git repository for DataSketches with mirroring to GitHub
> > enabled similar the following:
> >
> > * https://github.com/apache/incubator-datasketches.git
> >
> > === Issue Tracking ===
> >
> > We request the creation of an Apache-hosted JIRA. The DataSketches project
> > is currently using the public GitHub issue tracker and the public Google
> > Groups forum/sketches-user for issue tracking and discussions. We will
> > migrate and combine from these two sources to the Apache JIRA.
> >
> > Proposed Jira ID: DATASKETCHES
> >
> > == Initial Committers ==
> >
> > The following list of individuals have been extremely active in our
> > community and should have write (commit) permissions to the repository.
> >
> > * Eshcar Hillel                      [eshcar at verizonmedia dot com]
> >
> > * Kevin Lang                    [langk at verizonmedia dot com]
> >
> > * Roman Leventov              [roman.leventov at c.metamarkets dot com]
> >
> > * Edo Liberty                   [libertye at amazon dot com]
> >
> > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> >
> > * Lee Rhodes                  [lrhodes at verizonmedia dot com] & [leerho
> > at gmail dot com]
> >
> > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> >
> > * Justin Thaler                 [justin.thaler at georgetown dot edu]
> >
> > == Affiliations ==
> >
> > The initial committers are from four organizations: Yahoo, Amazon,
> > Georgetown University, and Metamarkets/Snap.
> >
> > === Champion ===
> > (Recommended to me: )
> >
> > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
> > dot org]
> > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> >
> > === Nominated Mentors ===
> > (Recommended to me: )
> >
> > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
> > dot org]
> > Jean-Baptiste Onofré, jb at nanthrax dot net
> > Gil Yehuda, gyehuda at verizonmedia dot com
> >
> > === Sponsoring Entity ===
> >
> > * The Apache Incubator    **** This is our 1st choice ****
> >
> > * Apache Druid. The incubating Apache Druid project might also be a logical
> > sponsor. However, DataSketches has applications in many areas of computing
> > outside of Druid so our preference and recommendation is that DataSketches
> > would ultimately be a top-level Apache project.
> >
> > ________________
> > [1] In 2017 Verizon acquired Yahoo and merged it with previously acquired
> > AOL. The merged entity was originally called Oath, Inc., but has recently
> > been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon,
> > Inc.  Since Yahoo is the more recognized name, references in this document
> > to Yahoo, are also a reference to Verizon Media, Inc.
> >
> > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > > The subject line has me interested already. Follow examples like this
> > > maybe?
> > >
> > > 1.
> > >
> > >
> > https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > 2.
> > >
> > >
> > https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > >
> > > Kenn
> > >
> > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com> wrote:
> > >
> > > > I'll try again ... :)
> > > >
> > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > >> It didn't make it again
> > > >>
> > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com> wrote:
> > > >>
> > > >> > I'm not sure the attached document made it through.
> > > >> >
> > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com> wrote:
> > > >> >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: DataSketches Proposal - Google Docs Link

Posted by Kenneth Knowles <ke...@apache.org>.

It worked. I've updated the shortlink to point to your doc.

Kenn

On Tue, Feb 26, 2019 at 4:02 PM Liang Chen <ch...@gmail.com> wrote:

> Hi Kenneth
>
> Please try this link :
>
> https://docs.google.com/document/d/1_cnesVLtKqPeUYxJvsd_2MTFwgeC1wUqI6cDPCbBRSM/edit#heading=h.97rxea60t2yw
>
> Regards
> Liang
>
>
> Kenneth Knowles wrote
> > I could not access that document. I suggest you need to turn on link
> > sharing.
> >
> > Kenn
> >
> > On Mon, Feb 25, 2019 at 12:00 PM
>
> > leerho@
>
> >  &lt;
>
> > leerho@
>
> > &gt; wrote:
> >
> >> Try this link:
> >>
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> >>
> >>
> >> On 2019/02/25 05:55:50, leerho &lt;
>
> > leerho@
>
> > &gt; wrote:
> >> > Yes I will try that tomorrow.
> >> >
> >> > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles &lt;
>
> > kenn@
>
> > &gt; wrote:
> >> >
> >> > > Can you share the Google doc with the proposal? Per Ted's advice, we
> >> can
> >> > > iterate quickly there and move it to the wiki when it becomes a bit
> >> more
> >> > > stable.
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Fri, Feb 22, 2019 at 10:21 PM
>
> > leerho@
>
> >  &lt;
>
> > leerho@
>
> > &gt;
> >> > > wrote:
> >> > >
> >> > > > Thanks for the offer.  i am a neophyte at this process and email
> >> app!   I
> >> > > > could use a lot of help getting this off the ground!  Also, I'm
> not
> >> sure
> >> > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> >> > > >
> >> > > > Lee.
> >> > > >
> >> > > > On 2019/02/23 06:03:58, Kenneth Knowles &lt;
>
> > kenn@
>
> > &gt; wrote:
> >> > > > > Nice.
> >> > > > >
> >> > > > > I would very much like to help mentor this project, though you
> >> already
> >> > > > have
> >> > > > > a couple good ones.
> >> > > > >
> >> > > > > I concur with incubator as sponsoring entity.
> >> > > > >
> >> > > > > Kenn (VP Apache Beam)
> >> > > > >
> >> > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho &lt;
>
> > leerho@
>
> > &gt; wrote:
> >> > > > >
> >> > > > > > I didn't realize that this mail list does not accept PDF
> files,
> >> > > > apparently
> >> > > > > > only text.  So let me try one more time ... :)  Please let me
> >> know if
> >> > > > > > this works!
> >> > > > > >
> >> > > > > >
> >> > > > > > = Apache DataSketches Proposal[1] =
> >> > > > > >
> >> > > > > > == Abstract ==
> >> > > > > >
> >> > > > > > DataSketches.GitHub.io is an open source, high-performance
> >> library
> >> > > of
> >> > > > > > stochastic streaming algorithms commonly called "sketches" in
> >> the
> >> > > data
> >> > > > > > sciences. Sketches are small, stateful programs that process
> >> massive
> >> > > > data
> >> > > > > > as a stream and can provide approximate answers, with
> >> mathematical
> >> > > > > > guarantees, to computationally difficult queries
> >> orders-of-magnitude
> >> > > > faster
> >> > > > > > than traditional, exact methods.
> >> > > > > >
> >> > > > > > This proposal is to move DataSketches to the Apache Software
> >> > > > > > Foundation(ASF) transferring ownership of its copyright
> >> intellectual
> >> > > > > > property to the ASF.  Thereafter, DataSketches would be
> >> officially
> >> > > > known as
> >> > > > > > Apache DataSketches and its evolution and governance would
> come
> >> under
> >> > > > the
> >> > > > > > rules and guidance of the ASF.
> >> > > > > >
> >> > > > > > == Introduction ==
> >> > > > > >
> >> > > > > > The DataSketches library contains carefully crafted
> >> implementations
> >> > > of
> >> > > > > > sketch algorithms that meet rigorous standards of quality and
> >> > > > performance
> >> > > > > > and provide capabilities required for large-scale production
> >> systems
> >> > > > that
> >> > > > > > must process and analyze massive data. The DataSketches core
> >> > > > repository is
> >> > > > > > written in Java with a parallel core repository written in C++
> >> that
> >> > > > > > includes Python wrappers. The DataSketches library also
> >> includes
> >> > > > special
> >> > > > > > repositories for extending the core library for Apache Hive
> and
> >> > > Apache
> >> > > > Pig.
> >> > > > > > The sketches developed in the different languages share a
> >> common
> >> > > binary
> >> > > > > > storage format so that sketches created and stored in Java,
> for
> >> > > > example,
> >> > > > > > can be fully used in C++, and visa versa.  Because the stored
> >> sketch
> >> > > > > > "images" are just a "blob" of bytes (similar to picture
> >> images),
> >> they
> >> > > > can
> >> > > > > > be shared across many different systems, languages and
> >> platforms.
> >> > > > > >
> >> > > > > > The DataSketches documentation website,
> >> > > https://datasketches.github.io
> >> > > > ,
> >> > > > > > includes general tutorials, a comprehensive research section
> >> with
> >> > > > > > references to relevant academic papers, extensive examples for
> >> using
> >> > > > the
> >> > > > > > core library directly as well as examples for accessing the
> >> library
> >> > > in
> >> > > > > > Hive, Pig, and Apache Spark.
> >> > > > > >
> >> > > > > > The DataSketches library also includes a characterization
> >> repository
> >> > > > for
> >> > > > > > long running test programs that are used for studying accuracy
> >> and
> >> > > > > > performance of these sketches over wide ranges of input
> >> variables.
> >> > > The
> >> > > > data
> >> > > > > > produced by these programs is used for generating the many
> >> > > performance
> >> > > > > > plots contained in the documentation website and for academic
> >> > > > > > publications.
> >> > > > > >
> >> > > > > > The code repositories used for production are versioned and
> >> published
> >> > > > to
> >> > > > > > Maven Central on periodic intervals as the library evolves.
> >> > > > > >
> >> > > > > > The DataSketches library also includes several experimental
> >> > > > repositories
> >> > > > > > for use-cases outside the large-scale systems environments,
> >> such
> >> as
> >> > > > > > sketches for mobile, IoT devices (Android), command-line
> access
> >> of
> >> > > the
> >> > > > > > sketch library, and an experimental repository for
> vector-based
> >> > > > sketches
> >> > > > > > that performs approximate Singular Value Decomposition (SVD)
> >> analysis
> >> > > > that
> >> > > > > > could potentially be used in Machine Learning (ML)
> >> applications.
> >> > > > > >
> >> > > > > > == Background ==
> >> > > > > >
> >> > > > > > The DataSketches library was started in 2012 as internal Yahoo
> >> > > project
> >> > > > to
> >> > > > > > dramatically reduce time and resources required for distinct
> >> (unique)
> >> > > > > > counting.  An extensive search on the Internet at the time
> >> yielded a
> >> > > > number
> >> > > > > > of theoretical papers on stochastic streaming algorithms with
> >> > > > pseudocode
> >> > > > > > examples, but we did not find any usable open-source code of
> >> the
> >> > > > quality we
> >> > > > > > felt we needed for our internal production systems.  So we
> >> started a
> >> > > > small
> >> > > > > > project (one person) to develop our own sketches working
> >> directly
> >> > > from
> >> > > > > > published theoretical papers.
> >> > > > > >
> >> > > > > > The DataSketches library was designed from the start with the
> >> > > > objective of
> >> > > > > > making these algorithms, usually only described in theoretical
> >> > > papers,
> >> > > > > > easily accessible to systems developers for use in our
> internal
> >> > > > production
> >> > > > > > systems. By necessity, the code had to be of the highest
> >> quality
> >> and
> >> > > > > > thoroughly tested. The wide variety of our internal production
> >> > > systems
> >> > > > > > drove the requirement that the sketch implementations had to
> >> have an
> >> > > > > > absolute minimum of external, run-time dependencies in order
> to
> >> > > > simplify
> >> > > > > > integration and troubleshooting.
> >> > > > > >
> >> > > > > > Our internal experiments demonstrated dramatic positive impact
> >> on the
> >> > > > > > performance of our systems.  As a result, the DataSketches
> >> library
> >> > > > quickly
> >> > > > > > evolved to include different types of sketches for different
> >> types of
> >> > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> >> algorithms,
> >> > > > > > quantile/histogram algorithms, and weighted and unweighted
> >> sampling
> >> > > > > > algorithms.
> >> > > > > >
> >> > > > > > We quickly discovered that developing these sketch algorithms
> >> to
> >> be
> >> > > > truly
> >> > > > > > robust in production environments is quite difficult and
> >> requires
> >> > > deep
> >> > > > > > understanding of the underlying mathematics and statistics as
> >> well as
> >> > > > > > extensive experience in developing high quality code for 24/7
> >> > > > production
> >> > > > > > systems. This is a difficult combination of skills for any one
> >> > > > organization
> >> > > > > > to collect and maintain over time. It became clear that this
> >> > > technology
> >> > > > > > needed a community larger than Yahoo to evolve.  In November,
> >> 2015,
> >> > > > this
> >> > > > > > factor, along with Yahoo’s strong experience and support of
> >> open
> >> > > > source,
> >> > > > > > led to the decision to open source this technology under an
> >> Apache
> >> > > 2.0
> >> > > > > > license on GitHub. Since that time our community has expanded
> >> > > > considerably
> >> > > > > > and the key contributors to this effort includes leading
> >> research
> >> > > > > > scientists from a number of universities as well as
> >> practitioners and
> >> > > > > > researchers from a number of major corporations. The core of
> >> this
> >> > > > group is
> >> > > > > > very active as we meet weekly to discuss research directions
> >> and
> >> > > > > > engineering priorities.
> >> > > > > >
> >> > > > > > It is important to note that our internal systems at Yahoo use
> >> the
> >> > > > current
> >> > > > > > public GitHub open source DataSketches library and not an
> >> internal
> >> > > > version
> >> > > > > > of the code.
> >> > > > > >
> >> > > > > > The close collaboration of scientific research and engineering
> >> > > > development
> >> > > > > > experience with actual massive-data processing systems has
> also
> >> > > > produced
> >> > > > > > new research publications in the field of stochastic streaming
> >> > > > algorithms,
> >> > > > > > for example:
> >> > > > > >
> >> > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty,
> Lee
> >> > > > Rhodes, and
> >> > > > > > Justin Thaler. A high-performance algorithm for identifying
> >> frequent
> >> > > > items
> >> > > > > > in data streams. In ACM IMC 2017.
> >> > > > > >
> >> > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> >> Thaler. A
> >> > > > > > framework for estimating stream expression cardinalities. In
> >> > > *EDBT/ICDT
> >> > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> >> > > > > >
> >> > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> >> Frequent
> >> > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> >> Proceedings
> >> > > > ‘16,
> >> > > > > > pages 845-854, 2016.
> >> > > > > >
> >> > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> >> quantile
> >> > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> >> 71–78,
> >> > > > 2016.
> >> > > > > >
> >> > > > > > * Kevin J Lang. Back to the future: an even more nearly
> optimal
> >> > > > cardinality
> >> > > > > > estimation algorithm. arXiv preprint
> >> > > https://arxiv.org/abs/1708.06839,
> >> > > > > > 2017.
> >> > > > > >
> >> > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In
> >> ACM
> >> KDD
> >> > > > > > Proceedings ‘13, pages 581– 588, 2013.
> >> > > > > >
> >> > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> >> Jonathan
> >> > > > Ullman.
> >> > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> >> > > > Proceedings
> >> > > > > > ‘16, pages 441–454, 2016.
> >> > > > > >
> >> > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> >> > > Hierarchical
> >> > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> >> > > > Proceedings
> >> > > > > > ‘12, pages 160–174, 2012.
> >> > > > > >
> >> > > > > > == The Rationale for Sketches ==
> >> > > > > >
> >> > > > > > In the analysis of big data there are often problem queries
> >> that
> >> > > don’t
> >> > > > > > scale because they require huge compute resources and time to
> >> > > generate
> >> > > > > > exact results. Examples include count distinct, quantiles,
> most
> >> > > > frequent
> >> > > > > > items, joins, matrix computations, and graph analysis.
> >> > > > > >
> >> > > > > > If we can loosen the requirement of “exact” results from our
> >> queries
> >> > > > and be
> >> > > > > > satisfied with approximate results, within some well
> understood
> >> > > bounds
> >> > > > of
> >> > > > > > error, there is an entire branch of mathematics and data
> >> science
> >> that
> >> > > > has
> >> > > > > > evolved around developing algorithms that can produce
> >> approximate
> >> > > > results
> >> > > > > > with mathematically well-defined error properties.
> >> > > > > >
> >> > > > > > With the additional requirements that these algorithms must be
> >> small
> >> > > > > > (compared to the size of the input data), sublinear (the size
> >> of
> >> the
> >> > > > sketch
> >> > > > > > must grow at a slower rate than the size of the input stream),
> >> > > > streaming
> >> > > > > > (they can only touch each data item once), and mergeable
> >> (suitable
> >> > > for
> >> > > > > > distributed processing), defines a class of algorithms that
> can
> >> be
> >> > > > > > described as small, stochastic, streaming, sublinear mergeable
> >> > > > algorithms,
> >> > > > > > commonly called sketches (they also have other names, but we
> >> will use
> >> > > > the
> >> > > > > > term sketches from here on).
> >> > > > > >
> >> > > > > > To be truly streaming and be able to process data in a single
> >> pass,
> >> > > > > > sketches must make absolute minimum assumptions about the
> input
> >> > > stream.
> >> > > > > > This is critically important, as there is no “second chance”
> to
> >> > > > process the
> >> > > > > > data.
> >> > > > > >
> >> > > > > > For example, sketches should not make assumptions about the
> >> order of
> >> > > > stream
> >> > > > > > items, the stream length, the dynamic range of values, or the
> >> > > > distribution
> >> > > > > > of item occurrence frequencies. Sketches should be tolerant of
> >> NaNs,
> >> > > > Nulls
> >> > > > > > and empty objects. About the only thing that the sketch needs
> >> to
> >> know
> >> > > > about
> >> > > > > > the stream is how to extract items from it and what type the
> >> item is,
> >> > > > e.g.,
> >> > > > > > is it a numeric value or a string.
> >> > > > > >
> >> > > > > > As far as the sketch is concerned, the input stream is a
> >> sequence of
> >> > > > items
> >> > > > > > in some unknown random order with unknown random values.
> >> > > > > >
> >> > > > > > The sketch is essentially a complex state machine and combined
> >> with
> >> > > the
> >> > > > > > random input stream defines a stochastic process. We then
> apply
> >> > > > > > probabilistic methods to interpret the states of the
> stochastic
> >> > > > process in
> >> > > > > > order to extract useful information about the input stream
> >> itself.
> >> > > The
> >> > > > > > resulting information will be approximate, but we also use
> >> additional
> >> > > > > > probabilistic methods to extract an estimate of the likely
> >> > > probability
> >> > > > > > distribution of error.
> >> > > > > >
> >> > > > > > There is a significant scientific contribution here that is
> >> defining
> >> > > > the
> >> > > > > > state machine, understanding the resulting stochastic process,
> >> > > > developing
> >> > > > > > the probabilistic methods, and proving mathematically, that it
> >> all
> >> > > > works!
> >> > > > > > This is why the scientific contributors to this project are a
> >> > > critical
> >> > > > and
> >> > > > > > strategic component to our success.  The development engineers
> >> > > > translate
> >> > > > > > the concepts of the proposed state machine and probabilistic
> >> methods
> >> > > > into
> >> > > > > > production-quality code. Even more important, they work
> closely
> >> with
> >> > > > the
> >> > > > > > scientists, feeding back system and user requirements, which
> >> leads
> >> > > not
> >> > > > only
> >> > > > > > to superior product design, but to new science as well.  A
> >> number of
> >> > > > > > scientific papers our members have published (see above) is a
> >> direct
> >> > > > result
> >> > > > > > of this close collaboration.
> >> > > > > >
> >> > > > > > Because sketches are small they can be processed extremely
> >> fast,
> >> > > often
> >> > > > many
> >> > > > > > orders-of-magnitude faster than traditional exact
> computations.
> >> For
> >> > > > > > interactive queries there may not be other viable
> alternatives,
> >> and
> >> > > in
> >> > > > the
> >> > > > > > case of real-time analysis, sketches are the only known
> >> solution.
> >> > > > > >
> >> > > > > > For any system that needs to extract useful information from
> >> massive
> >> > > > data
> >> > > > > > sketches are essential tools that should be tightly integrated
> >> into
> >> > > the
> >> > > > > > system’s analysis capabilities. This technology has helped
> >> Yahoo
> >> > > > > > successfully reduce data processing times from days to hours
> or
> >> > > > minutes on
> >> > > > > > a number of its internal platforms and has enabled subsecond
> >> queries
> >> > > on
> >> > > > > > real-time platforms that would have been infeasible without
> >> sketches.
> >> > > > > > The Rationale for Apache DataSketches
> >> > > > > > Other open source implementations of sketch algorithms can be
> >> found
> >> > > on
> >> > > > the
> >> > > > > > Internet. However, we have not yet found any open source
> >> > > > implementations
> >> > > > > > that are as comprehensive, engineered with the quality
> required
> >> for
> >> > > > > > production systems, and with usable and guaranteed error
> >> properties.
> >> > > > Large
> >> > > > > > Internet companies, such as Google and Facebook, have
> published
> >> > > papers
> >> > > > on
> >> > > > > > sketching, however, their implementations of their published
> >> > > > algorithms are
> >> > > > > > proprietary and not available as open source.
> >> > > > > >
> >> > > > > > The DataSketches library already provides integrations with a
> >> number
> >> > > of
> >> > > > > > major Apache data processing platforms such as Apache Hive,
> >> Apache
> >> > > Pig,
> >> > > > > > Apache Spark and Apache Druid, and is also integrated with a
> >> number
> >> > > of
> >> > > > > > other open source data processing platforms such as Splice
> >> Machine,
> >> > > > GCHQ
> >> > > > > > Gaffer and PostgreSQL.
> >> > > > > >
> >> > > > > > We believe that having DataSketches as an Apache project will
> >> provide
> >> > > > an
> >> > > > > > immediate, worthwhile, and substantial contribution to the
> open
> >> > > source
> >> > > > > > community, will have a better opportunity to provide a
> >> meaningful
> >> > > > > > contribution to both the science and engineering of sketching
> >> > > > algorithms,
> >> > > > > > and integrate with other Apache projects.  In addition, this
> is
> >> a
> >> > > > > > significant opportunity for Apache to be the "go-to"
> >> destination
> >> for
> >> > > > users
> >> > > > > > that want to leverage this exciting technology.
> >> > > > > >
> >> > > > > > == Initial Goals ==
> >> > > > > >
> >> > > > > > We are breaking our initial goals into short-term (2-6 months)
> >> and
> >> > > > > > intermediate to long-term ( 6 months to 2 years):
> >> > > > > >
> >> > > > > > Our short-term goals include:
> >> > > > > >
> >> > > > > > * Understanding and adapting to the Apache development process
> >> and
> >> > > > > > structures.
> >> > > > > >
> >> > > > > > * Start refactoring codebase and move various DataSketches
> >> > > repositories
> >> > > > > > code to Apache Git repository.
> >> > > > > >
> >> > > > > > * Continue development of new features, functions, and fixes.
> >> > > > > >
> >> > > > > > * Specific sub-projects (e.g., C++ and Python) will continue
> to
> >> be
> >> > > > > > developed and expanded.
> >> > > > > >
> >> > > > > >
> >> > > > > > The intermediate to long term goals include:
> >> > > > > >
> >> > > > > > * Completing the design and implementation of the C++ sketches
> >> to
> >> > > > > > complement what is already available in Java, and the Python
> >> wrappers
> >> > > > of
> >> > > > > > those C++ sketches.
> >> > > > > >
> >> > > > > > * Expanding the C++ build framework to include Windows and the
> >> > > popular
> >> > > > > > Linux variants.
> >> > > > > >
> >> > > > > > * Continued engagement with the scientific research community
> >> on
> >> the
> >> > > > > > development of new algorithms for computationally difficult
> >> problems
> >> > > > that
> >> > > > > > heretofore have not had a sketching solution.
> >> > > > > >
> >> > > > > > == Current Status ==
> >> > > > > >
> >> > > > > > The DataSketches GitHub project has been quite successful.  As
> >> of
> >> > > this
> >> > > > > > writing (Feb, 2019) the number of downloads measured by the
> >> Nexus
> >> > > > > > Repository Manager at https://oss.sonatype.org has grown by
> >> nearly a
> >> > > > > > factor
> >> > > > > > of 10 over the past year to about 55 thousand per month. The
> >> > > > > > DataSketches/sketches-core repository has about 560 stars and
> >> 141
> >> > > > forks,
> >> > > > > > which is pretty good for a highly specialized library.
> >> > > > > >
> >> > > > > > === Development Practices ===
> >> > > > > >
> >> > > > > > ==== Source Control ====
> >> > > > > >
> >> > > > > > All of our developers have extensive experience with Git
> >> version
> >> > > > control
> >> > > > > > and follow accepted practices for use of Pull Requests (PRs),
> >> code
> >> > > > reviews
> >> > > > > > and commits to master, for example.
> >> > > > > >
> >> > > > > > ==== Testing ====
> >> > > > > >
> >> > > > > > Sketches, by their nature are probabilistic programs and don’t
> >> > > > necessarily
> >> > > > > > behave deterministically.  For some of the sketches we
> >> intentionally
> >> > > > insert
> >> > > > > > random noise into the code as this gives us the mathematical
> >> > > properties
> >> > > > > > that we need to guarantee accuracy.  This can make the
> behavior
> >> of
> >> > > > these
> >> > > > > > algorithms quite unintuitive and provides significant
> >> challenges
> >> to
> >> > > the
> >> > > > > > developer who wishes to test these algorithms for correctness.
> >> As a
> >> > > > result,
> >> > > > > > our testing strategy includes two major components: unit
> tests,
> >> and
> >> > > > > > characterization tests.
> >> > > > > >
> >> > > > > > ===== Unit Testing =====
> >> > > > > >
> >> > > > > > Our unit tests are primarily quick tests to make sure that we
> >> > > exercise
> >> > > > all
> >> > > > > > critical paths in the code and that key branches are executed
> >> > > > correctly. It
> >> > > > > > is important that they execute relatively fast as they are
> >> generally
> >> > > > run on
> >> > > > > > every code build. The sketches-core repository alone has about
> >> 22
> >> > > > thousand
> >> > > > > > statements, over 1300 unit tests and code coverage of about
> >> 98.2% as
> >> > > > > > measured by Atlassian/Clover.  It is our goal for all of our
> >> code
> >> > > > > > repositories that are used in production that they have code
> >> coverage
> >> > > > > > greater than 90%.
> >> > > > > >
> >> > > > > > ===== Characterization Testing =====
> >> > > > > >
> >> > > > > > In order to test the probabilistic methods that are used to
> >> interpret
> >> > > > the
> >> > > > > > stochastic behaviors of our sketches we have a separate
> >> > > > characterization
> >> > > > > > repository that is dedicated to this.  To measure accuracy,
> for
> >> > > > example,
> >> > > > > > requires running thousands of trials at each of many different
> >> points
> >> > > > along
> >> > > > > > the domain axis. Each trial compares its estimated results
> >> against a
> >> > > > known
> >> > > > > > exact result producing an error for that trial.  These error
> >> > > > measurements
> >> > > > > > are then fed into our Quantiles sketch to capture the actual
> >> > > > distribution
> >> > > > > > of error at that point along the axis. We then select quantile
> >> > > contours
> >> > > > > > across all the distributions at points along the axis.  These
> >> > > contours
> >> > > > can
> >> > > > > > then be plotted to reveal the shape of the actual error
> >> distribution.
> >> > > > These
> >> > > > > > distributions are not at all Gaussian, in fact they can be
> >> quite
> >> > > > complex.
> >> > > > > > Nonetheless, these distributions are then checked against our
> >> > > > statistical
> >> > > > > > guarantees inherent to the specific sketch algorithm and its
> >> > > > parameters.
> >> > > > > > There are many examples of these characterization error
> >> distributions
> >> > > > on
> >> > > > > > our website. The runtimes of these tests can be very long and
> >> can
> >> > > range
> >> > > > > > from many minutes to hours, and some can run for days.
> >> Currently, we
> >> > > > have
> >> > > > > > separate characterization repositories for Java and C++ /
> >> Python.
> >> > > > > >
> >> > > > > > It is our goal that we perform this characterization analysis
> >> for all
> >> > > > of
> >> > > > > > our sketches.  By definition, the code that runs these
> >> > > characterization
> >> > > > > > tests is open-source so others can run these tests as well.
> We
> >> do
> >> > > not
> >> > > > have
> >> > > > > > formal releases of this code (because it is not production
> >> code)
> >> and
> >> > > > it is
> >> > > > > > not published to Maven Central.
> >> > > > > >
> >> > > > > > === Meritocracy ===
> >> > > > > >
> >> > > > > > DataSketches was initially developed based on requirements
> >> within
> >> > > > Yahoo. As
> >> > > > > > a project on GitHub, DataSketches has received contributions
> >> from
> >> > > > numerous
> >> > > > > > individual developers from around the world, dedicated
> research
> >> work
> >> > > > from
> >> > > > > > senior scientists at Amazon and Visa, and academic researchers
> >> from
> >> > > > > > Georgetown University, Princeton, and MIT.
> >> > > > > >
> >> > > > > > As a project under incubation, we are committed to expanding
> >> our
> >> > > > effort to
> >> > > > > > build an environment which supports a meritocracy. We are
> >> focused on
> >> > > > > > engaging the community and other related projects for support
> >> and
> >> > > > > > contributions. Moreover, we are committed to ensure
> >> contributors
> >> and
> >> > > > > > committers to DataSketches come from a broad mix of
> >> organizations
> >> > > > through a
> >> > > > > > merit-based decision process during incubation. We believe
> >> strongly
> >> > > in
> >> > > > the
> >> > > > > > DataSketches premise that fulfills the concept of a well
> >> engineered
> >> > > and
> >> > > > > > scientifically rigorous library that implements these powerful
> >> > > > algorithms
> >> > > > > > and are committed to growing an inclusive community of
> >> DataSketches
> >> > > > > > contributors and users.
> >> > > > > >
> >> > > > > > === Community ===
> >> > > > > >
> >> > > > > > Yahoo has a long history and active engagement in the Open
> >> Source
> >> > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> >> > > Panoptes,
> >> > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> >> TensorFlowOnSpark,
> >> > > > gifshot,
> >> > > > > > fluxible, as well as the creation, contribution and incubation
> >> of
> >> > > many
> >> > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> >> > > > Zookeeper,
> >> > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> >> > > > > >
> >> > > > > > Every day, DataSketches is actively used by a organizations
> and
> >> > > > > > institutions around the world for batch and stream processing
> >> of
> >> > > data.
> >> > > > We
> >> > > > > > believe acceptance will allow us to consolidate existing
> >> > > > > > DataSketches-related work, grow the DataSketches community,
> and
> >> > > deepen
> >> > > > > > connections between DataSketches and other open source
> >> projects.
> >> > > > > >
> >> > > > > > === Introduction to the Core Developers & Contributors ===
> >> > > > > >
> >> > > > > > The core developers and contributors for DataSketches are from
> >> > > diverse
> >> > > > > > backgrounds, but primarily are scientists that love
> engineering
> >> and
> >> > > > > > engineers that love science. A large part of the value we
> bring
> >> comes
> >> > > > from
> >> > > > > > this synthesis.  These individuals have already contributed
> >> > > > substantially
> >> > > > > > to the code, algorithms, and/or mathematical proofs that form
> >> the
> >> > > > basis of
> >> > > > > > the library.
> >> > > > > >
> >> > > > > > This core group also form the Initial Committers with write
> >> > > > permissions to
> >> > > > > > the repository. Those marked with (*) Meet weekly to plan the
> >> > > research
> >> > > > and
> >> > > > > > engineering direction of the project.
> >> > > > > >
> >> > > > > > ==== Scientists That Love Engineering ====
> >> > > > > >
> >> > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> Israel.
> >> > > > Interests:
> >> > > > > > distributed systems, scalable systems and platforms for big
> >> data
> >> > > > > > processing, concurrent algorithms and data structures,
> >> > > > > >
> >> > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> Labs,
> >> > > > Sunnyvale,
> >> > > > > > California. Interests: algorithms, theoretical and applied
> >> > > mathematics,
> >> > > > > > encoding and compression theory, theoretical and applied
> >> performance
> >> > > > > > optimization.
> >> > > > > >
> >> > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> >> Labs,
> >> Palo
> >> > > > Alto,
> >> > > > > > California. Manages the algorithms group at Amazon AI. We
> build
> >> > > > scalable
> >> > > > > > machine learning systems and algorithms which are used both
> >> > > internally
> >> > > > and
> >> > > > > > externally by customers of SageMaker, AWS's flagship machine
> >> learning
> >> > > > > > platform.
> >> > > > > >
> >> > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> >> Interests:
> >> > > > > > Computational advertising, machine learning, speech
> >> recognition,
> >> > > > > > data-driven analysis, large scale experimentation, big data,
> >> > > > stream/complex
> >> > > > > > event processing
> >> > > > > >
> >> > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> >> Computer
> >> > > > Science,
> >> > > > > > Georgetown University, Washington D.C. Interests: algorithms
> >> and
> >> > > > > > computational complexity, complexity theory, quantum
> >> algorithms,
> >> > > > private
> >> > > > > > data analysis, and learning theory, developing efficient
> >> streaming
> >> > > and
> >> > > > > > sketching algorithms
> >> > > > > >
> >> > > > > > ==== Engineers That Love Science ====
> >> > > > > >
> >> > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> >> Snap.
> >> > > > Interests:
> >> > > > > > design and implementation of data storing and data processing
> >> > > > (distributed)
> >> > > > > > systems, performance optimization, CPU performance, mechanical
> >> > > > sympathy,
> >> > > > > > JVM performance, API design, databases, (concurrent) data
> >> structures,
> >> > > > > > memory management, garbage collection algorithms, language
> >> design and
> >> > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> >> efficiency,
> >> > > > Linux,
> >> > > > > > code quality, code transformation, pure functional programming
> >> > > models,
> >> > > > > > Haskell.
> >> > > > > >
> >> > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> >> founder
> >> > > > of
> >> > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> >> Interests:
> >> > > > > > streaming algorithms, mathematics, computer science, high
> >> quality and
> >> > > > high
> >> > > > > > performance code for the analysis of massive data, bridging
> the
> >> > > divide
> >> > > > > > between theory and practice.
> >> > > > > >
> >> > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> >> Sunnyvale,
> >> > > > > > California. Interests: applied mathematics, computer science,
> >> big
> >> > > data,
> >> > > > > > distributed systems.
> >> > > > > >
> >> > > > > > === Introduction to Additional Interested Contributors ===
> >> > > > > >
> >> > > > > > These folks have been intermittently involved and contributed,
> >> but
> >> > > are
> >> > > > > > strong supporters of this project.
> >> > > > > >
> >> > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> >> > > > > >
> >> > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> >> Computer
> >> > > > Science,
> >> > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> >> > > > > > approximation, streaming algorithms, randomized linear
> algebra.
> >> > > > > >
> >> > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> Ph.D.
> >> > > > Computer
> >> > > > > > Science, Research Instructor, Princeton University. Interests:
> >> > > > algorithmic
> >> > > > > > foundations of data science and machine learning, efficient
> >> methods
> >> > > for
> >> > > > > > processing and understanding large datasets, often working at
> >> the
> >> > > > > > intersection of theoretical computer science, numerical linear
> >> > > > algebra, and
> >> > > > > > optimization.
> >> > > > > >
> >> > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> >> Computer
> >> > > > Science,
> >> > > > > > Professor, Warwick University, Warwick, England. Interests:
> all
> >> > > > aspects of
> >> > > > > > the "data lifecycle", from data collection and cleaning,
> >> through
> >> > > > mining and
> >> > > > > > analytics. (Professor Cormode is one of the world’s leading
> >> > > scientists
> >> > > > in
> >> > > > > > sketching algorithms)
> >> > > > > >
> >> > > > > > === Alignment ===
> >> > > > > >
> >> > > > > > The DataSketches library already provides integrations and
> >> example
> >> > > > code for
> >> > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated
> >> into
> >> > > > Apache
> >> > > > > > Druid.
> >> > > > > >
> >> > > > > > == Known Risks ==
> >> > > > > >
> >> > > > > > The following subsections are specific risks that have been
> >> > > identified
> >> > > > by
> >> > > > > > the ASF that need to be addressed.
> >> > > > > >
> >> > > > > > === Risk: Orphaned Products ===
> >> > > > > >
> >> > > > > > The DataSketches library is presently used by a number of
> >> > > > organizations,
> >> > > > > > from small startups to Fortune 100 companies, to construct
> >> production
> >> > > > > > pipelines that must process and analyze massive data. Yahoo
> has
> >> a
> >> > > > long-term
> >> > > > > > commitment to continue to advance the DataSketches library;
> >> moreover,
> >> > > > > > DataSketches is seeing increasing interest, development, and
> >> adoption
> >> > > > from
> >> > > > > > many diverse organizations from around the world. Due to its
> >> growing
> >> > > > > > adoption, we feel it is quite unlikely that this project would
> >> become
> >> > > > > > orphaned.
> >> > > > > >
> >> > > > > > === Risk: Inexperience with Open Source ===
> >> > > > > >
> >> > > > > > Yahoo believes strongly in open source and the exchange of
> >> > > information
> >> > > > to
> >> > > > > > advance new ideas and work. Examples of this commitment are
> >> active
> >> > > open
> >> > > > > > source projects such as those mentioned above. With
> >> DataSketches, we
> >> > > > have
> >> > > > > > been increasingly open and forward-looking; we have published
> a
> >> > > number
> >> > > > of
> >> > > > > > papers about breakthrough developments in the science of
> >> streaming
> >> > > > > > algorithms (mentioned above) that also reference the
> >> DataSketches
> >> > > > library.
> >> > > > > > Our submission to the Apache Software Foundation is a logical
> >> > > > extension of
> >> > > > > > our commitment to open source software.
> >> > > > > >
> >> > > > > > Key committers at Yahoo with strong open source backgrounds
> >> include
> >> > > > Aaron
> >> > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> >> > > Andrews
> >> > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> >> Call,
> >> > > Daryn
> >> > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> Eshcar
> >> > > Hillel,
> >> > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> >> > > Perez-Sorrosal,
> >> > > > Gil
> >> > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> >> James
> >> > > > Penick,
> >> > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> >> Eagles,
> >> > > > Kihwal
> >> > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> >> Trelinski,
> >> > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> >> > > Natkovich,
> >> > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby
> >> Loo,
> >> > > > Ryan
> >> > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
> >> Chan,
> >> > > Sri
> >> > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> >> more.
> >> > > > > >
> >> > > > > > All of our core developers are committed to learn about the
> >> Apache
> >> > > > process
> >> > > > > > and to give back to the community.
> >> > > > > >
> >> > > > > > === Risk: Homogeneous Developers ===
> >> > > > > >
> >> > > > > > The majority of committers in this proposal belong to Yahoo
> due
> >> to
> >> > > the
> >> > > > fact
> >> > > > > > that DataSketches has emerged from an internal Yahoo project.
> >> This
> >> > > > proposal
> >> > > > > > also includes developers and contributors from other
> companies,
> >> and
> >> > > > who are
> >> > > > > > actively involved with other Apache projects, such as Druid.
> >> We
> >> > > > expect our
> >> > > > > > entry into incubation will allow us to expand the number of
> >> > > > individuals and
> >> > > > > > organizations participating in DataSketches development.
> >> > > > > >
> >> > > > > > === Risk: Reliance on Salaried Developers ===
> >> > > > > >
> >> > > > > > Because the DataSketches library originated within Yahoo, it
> >> has
> >> been
> >> > > > > > developed primarily by salaried Yahoo developers and we expect
> >> that
> >> > > to
> >> > > > > > continue to be the case near term. However, since we placed
> >> this
> >> > > > library
> >> > > > > > into open-source we have had a number of significant
> >> contributions
> >> > > from
> >> > > > > > engineers and scientists from outside of Yahoo. We expect our
> >> > > reliance
> >> > > > on
> >> > > > > > Yahoo salaried developers will decrease over time.
> Nonetheless,
> >> Yahoo
> >> > > > is
> >> > > > > > committed to continue its strong support of this important
> >> project.
> >> > > > > >
> >> > > > > > === Risk: Lack of Relationship to other Apache Products ===
> >> > > > > >
> >> > > > > > DataSketches already directly interoperates with or utilizes
> >> several
> >> > > > > > existing Apache projects.
> >> > > > > >
> >> > > > > > * Build
> >> > > > > >    * Apache Maven
> >> > > > > >
> >> > > > > > * Integrations and adaptors for the following projects
> >> naturally
> >> have
> >> > > > them
> >> > > > > > as dependencies
> >> > > > > >    * Apache Hive
> >> > > > > >    * Apache Pig
> >> > > > > >    * Apache Druid
> >> > > > > >    * Apache Spark
> >> > > > > >
> >> > > > > > * Additional dependencies for the above integrations and
> >> adaptors
> >> > > > include
> >> > > > > >    * Apache Hadoop
> >> > > > > >    * Apache Commons (Math)
> >> > > > > >
> >> > > > > > There is no other Apache project that we are aware of that
> >> duplicates
> >> > > > the
> >> > > > > > functionality of the DataSketches library.
> >> > > > > >
> >> > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> >> > > > > >
> >> > > > > > With this proposal we are not seeking attention or publicity.
> >> Rather,
> >> > > > we
> >> > > > > > firmly believe in the DataSketches library and concept and the
> >> > > ability
> >> > > > to
> >> > > > > > make the DataSketches library a powerful, yet simple-to-use
> >> toolkit
> >> > > for
> >> > > > > > data processing. While the DataSketches library has been open
> >> source,
> >> > > > we
> >> > > > > > believe putting code on GitHub can only go so far. We see the
> >> Apache
> >> > > > > > community, processes, and mission as critical for ensuring the
> >> > > > DataSketches
> >> > > > > > library is truly community-driven, positively impactful, and
> >> > > innovative
> >> > > > > > open source software. While Yahoo has taken a number of steps
> >> to
> >> > > > advance
> >> > > > > > its various open source projects, we believe the DataSketches
> >> library
> >> > > > > > project is a great fit for the Apache Software Foundation due
> >> to
> >> its
> >> > > > focus
> >> > > > > > on data processing and its relationships to existing ASF
> >> projects.
> >> > > > > >
> >> > > > > > === Risk: Cryptography ===
> >> > > > > >
> >> > > > > > DataSketches does not contain any cryptographic code and is
> not
> >> a
> >> > > > > > cryptographic product.
> >> > > > > >
> >> > > > > > == Documentation ==
> >> > > > > >
> >> > > > > > The following documentation is relevant to this proposal.
> >> Relevant
> >> > > > portions
> >> > > > > > of the documentation will be contributed to the Apache
> >> DataSketches
> >> > > > > > project.
> >> > > > > >
> >> > > > > > * DataSketches website: https://datasketches.github.io.
> >> > > > > >
> >> > > > > > * DataSketches website repository:
> >> > > > > > https://github.com/DataSketches/DataSketches.github.io
> >> > > > > >
> >> > > > > > We will need an apache website for this documentation similar
> >> to
> >> > > > > >
> >> > > > > > * https://datasketches.apache.org
> >> > > > > >
> >> > > > > > == Initial Source ==
> >> > > > > >
> >> > > > > > The initial source for DataSketches which we will submit to
> the
> >> > > Apache
> >> > > > > > Foundation will include a number of repositories which are
> >> currently
> >> > > > hosted
> >> > > > > > under the GitHub.com/datasketches organization:
> >> > > > > >
> >> > > > > > All github.com/datasketches repositories including:
> >> > > > > >
> >> > > > > > * Java
> >> > > > > >    * sketches-core: This repository has the core sketching
> >> classes,
> >> > > > which
> >> > > > > > are leveraged by some of the other repositories. This
> >> repository
> >> has
> >> > > no
> >> > > > > > external dependencies outside of the DataSketches/memory
> >> repository,
> >> > > > Java
> >> > > > > > and TestNG for unit tests. This code is versioned and the
> >> latest
> >> > > > release
> >> > > > > > can be obtained from Maven Central.
> >> > > > > >    * memory: Low level, high-performance memory data-structure
> >> > > > management
> >> > > > > > primarily for off-heap.
> >> > > > > >    * sketches-android: This is a new repository dedicated to
> >> sketches
> >> > > > > > designed to be run in a mobile client, such as a cell phone.
> It
> >> is
> >> > > > still in
> >> > > > > > development and should be considered experimental.
> >> > > > > >    * sketches-hive: This repository contains Hive UDFs and
> >> UDAFs
> >> for
> >> > > > use
> >> > > > > > within Hadoop grid environments. This code has dependencies on
> >> > > > > > sketches-core as well as Hadoop and Hive. Users of this code
> >> are
> >> > > > advised to
> >> > > > > > use Maven to bring in all the required dependencies. This code
> >> is
> >> > > > versioned
> >> > > > > > and the latest release can be obtained from Maven Central.
> >> > > > > >    * sketches-pig: This repository contains Pig User Defined
> >> > > Functions
> >> > > > > > (UDF) for use within Hadoop grid environments. This code has
> >> > > > dependencies
> >> > > > > > on sketches-core as well as Hadoop and Pig. Users of this code
> >> are
> >> > > > advised
> >> > > > > > to use Maven to bring in all the required dependencies. This
> >> code is
> >> > > > > > versioned and the latest release can be obtained from Maven
> >> Central.
> >> > > > > >    * sketches-vector: This is a new repository dedicated to
> >> sketches
> >> > > > for
> >> > > > > > vector and matrix operations. It is still somewhat
> >> experimental.
> >> > > > > >    * characterization: This relatively new repository is for
> >> code
> >> > > that
> >> > > > we
> >> > > > > > use to characterize the accuracy and speed performance of the
> >> > > sketches
> >> > > > in
> >> > > > > > the library and is constantly being updated. Examples of the
> >> job
> >> > > > command
> >> > > > > > files used for various tests can be found in the
> >> src/main/resources
> >> > > > > > directory. Some of these tests can run for hours depending on
> >> its
> >> > > > > > configuration.
> >> > > > > >    * experimental: This repository is an experimental staging
> >> area
> >> > > for
> >> > > > code
> >> > > > > > that will eventually end up in another repository. This code
> is
> >> not
> >> > > > > > versioned and not registered with Maven Central.
> >> > > > > >    * sketches-misc: Demos and other code not related to
> >> production
> >> > > > > > deployment
> >> > > > > >
> >> > > > > > * C++ and Python
> >> > > > > >    * sketches-core-cpp: This is the C++/Python companion to
> the
> >> Java
> >> > > > > > sketches-core. These implementations are binary compatible
> with
> >> their
> >> > > > > > counterparts in Java. In other words, a sketch created and
> >> stored in
> >> > > > C++
> >> > > > > > can be opened and read in Java and visa-versa. This site also
> >> has our
> >> > > > > > Python adaptors that basically wrap the C++ implementations,
> >> making
> >> > > the
> >> > > > > > high performance C++ implementations available from Python.
> >> > > > > >    * sketches-postgres: This site provides the
> >> postgres-specific
> >> > > > adaptors
> >> > > > > > that wrap the C++ implementations making them available to the
> >> > > Postgres
> >> > > > > > database users.
> >> > > > > >    * characterization-cpp: This is the C++/Python companion to
> >> the
> >> > > Java
> >> > > > > > characterization repository.
> >> > > > > >    * experimental-cpp: This repository is an experimental
> >> staging
> >> > > area
> >> > > > for
> >> > > > > > C++ code that will eventually end up in another repository.
> >> > > > > >
> >> > > > > > * Command-Line Tools
> >> > > > > >    * sketches-cmd
> >> > > > > >    * homebrew-sketches
> >> > > > > >    * homebrew-sketches-cmd
> >> > > > > >
> >> > > > > > These projects have always been Apache 2.0 licensed. We intend
> >> to
> >> > > > bundle
> >> > > > > > all of these repositories since they are all complementary and
> >> should
> >> > > > be
> >> > > > > > maintained in one project. Prior to our submission, we will
> >> combine
> >> > > > all of
> >> > > > > > these projects into a new git repository.
> >> > > > > >
> >> > > > > > == Source and Intellectual Property Submission Plan ==
> >> > > > > >
> >> > > > > > Contributors to the DataSketches project have also signed the
> >> Yahoo
> >> > > > > > Individual Contributor License Agreement (
> >> > > > https://yahoocla.herokuapp.com/
> >> > > > > > in order to contribute to the project.
> >> > > > > >
> >> > > > > > With respect to trademark rights, Yahoo does not hold a
> >> trademark on
> >> > > > the
> >> > > > > > phrase “DataSketches.” Based on feedback and guidance we
> >> receive
> >> > > > during the
> >> > > > > > incubation process, we are open to renaming the project if
> >> necessary
> >> > > > for
> >> > > > > > trademark or other concerns, but we would prefer not to have
> to
> >> do
> >> > > > that.
> >> > > > > >
> >> > > > > > == External Dependencies ==
> >> > > > > >
> >> > > > > > All external dependencies are licensed under an Apache 2.0 or
> >> > > > > > Apache-compatible license. As we grow the DataSketches
> >> community
> >> we
> >> > > > will
> >> > > > > > configure our build process to require and validate all
> >> contributions
> >> > > > and
> >> > > > > > dependencies are licensed under the Apache 2.0 license or are
> >> under
> >> > > an
> >> > > > > > Apache-compatible license.
> >> > > > > >
> >> > > > > > == Required Resources ==
> >> > > > > >
> >> > > > > > === Mailing Lists ===
> >> > > > > >
> >> > > > > > We currently use a mix of mailing lists. We will migrate our
> >> existing
> >> > > > > > mailing lists to the following:
> >> > > > > >
> >> > > > > > *
>
> > dev@.apache
>
> >> > > > > >
> >> > > > > > *
>
> > user@.apache
>
> >> > > > > >
> >> > > > > > *
>
> > private@.apache
>
> >> > > > > >
> >> > > > > > *
>
> > commits@.apache
>
> >> > > > > >
> >> > > > > > === Source Control ===
> >> > > > > >
> >> > > > > > The DataSketches team currently uses Git and would like to
> >> continue
> >> > > to
> >> > > > do
> >> > > > > > so. We request a Git repository for DataSketches with
> mirroring
> >> to
> >> > > > GitHub
> >> > > > > > enabled similar the following:
> >> > > > > >
> >> > > > > > * https://github.com/apache/incubator-datasketches.git
> >> > > > > >
> >> > > > > > === Issue Tracking ===
> >> > > > > >
> >> > > > > > We request the creation of an Apache-hosted JIRA. The
> >> DataSketches
> >> > > > project
> >> > > > > > is currently using the public GitHub issue tracker and the
> >> public
> >> > > > Google
> >> > > > > > Groups forum/sketches-user for issue tracking and discussions.
> >> We
> >> > > will
> >> > > > > > migrate and combine from these two sources to the Apache JIRA.
> >> > > > > >
> >> > > > > > Proposed Jira ID: DATASKETCHES
> >> > > > > >
> >> > > > > > == Initial Committers ==
> >> > > > > >
> >> > > > > > The following list of individuals have been extremely active
> in
> >> our
> >> > > > > > community and should have write (commit) permissions to the
> >> > > repository.
> >> > > > > >
> >> > > > > > * Eshcar Hillel                      [eshcar at verizonmedia
> >> dot
> >> com]
> >> > > > > >
> >> > > > > > * Kevin Lang                    [langk at verizonmedia dot
> com]
> >> > > > > >
> >> > > > > > * Roman Leventov              [roman.leventov at c.metamarkets
> >> dot
> >> > > com]
> >> > > > > >
> >> > > > > > * Edo Liberty                   [libertye at amazon dot com]
> >> > > > > >
> >> > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot
> >> com]
> >> > > > > >
> >> > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> com]
> >> &
> >> > > > [leerho
> >> > > > > > at gmail dot com]
> >> > > > > >
> >> > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot
> com]
> >> > > > > >
> >> > > > > > * Justin Thaler                 [justin.thaler at georgetown
> >> dot
> >> edu]
> >> > > > > >
> >> > > > > > == Affiliations ==
> >> > > > > >
> >> > > > > > The initial committers are from four organizations: Yahoo,
> >> Amazon,
> >> > > > > > Georgetown University, and Metamarkets/Snap.
> >> > > > > >
> >> > > > > > === Champion ===
> >> > > > > > (Recommended to me: )
> >> > > > > >
> >> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
> >> at
> >> > > > apache
> >> > > > > > dot org]
> >> > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> >> > > > > >
> >> > > > > > === Nominated Mentors ===
> >> > > > > > (Recommended to me: )
> >> > > > > >
> >> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
> >> at
> >> > > > apache
> >> > > > > > dot org]
> >> > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> >> > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> >> > > > > >
> >> > > > > > === Sponsoring Entity ===
> >> > > > > >
> >> > > > > > * The Apache Incubator    **** This is our 1st choice ****
> >> > > > > >
> >> > > > > > * Apache Druid. The incubating Apache Druid project might also
> >> be a
> >> > > > logical
> >> > > > > > sponsor. However, DataSketches has applications in many areas
> >> of
> >> > > > computing
> >> > > > > > outside of Druid so our preference and recommendation is that
> >> > > > DataSketches
> >> > > > > > would ultimately be a top-level Apache project.
> >> > > > > >
> >> > > > > > ________________
> >> > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> >> previously
> >> > > > acquired
> >> > > > > > AOL. The merged entity was originally called Oath, Inc., but
> >> has
> >> > > > recently
> >> > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> >> > > Verizon,
> >> > > > > > Inc.  Since Yahoo is the more recognized name, references in
> >> this
> >> > > > document
> >> > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> >> > > > > >
> >> > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles &lt;
>
> > kenn@
>
> > &gt; >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > The subject line has me interested already. Follow examples
> >> like
> >> > > this
> >> > > > > > > maybe?
> >> > > > > > >
> >> > > > > > > 1.
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >>
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> >> > > > > > > 2.
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >>
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> >> > > > > > >
> >> > > > > > > Kenn
> >> > > > > > >
> >> > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho &lt;
>
> > leerho@
>
> > &gt;
> >> wrote:
> >> > > > > > >
> >> > > > > > > > I'll try again ... :)
> >> > > > > > > >
> >> > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> >> > >
>
> > ted.dunning@
>
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> It didn't make it again
> >> > > > > > > >>
> >> > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho &lt;
>
> > leerho@
>
> > &gt;
> >> wrote:
> >> > > > > > > >>
> >> > > > > > > >> > I'm not sure the attached document made it through.
> >> > > > > > > >> >
> >> > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho &lt;
>
> > leerho@
>
> > &gt;
> >> > > > wrote:
> >> > > > > > > >> >
> >> > > > > > > >> > >
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > >
> >> ---------------------------------------------------------------------
> >> > > > > > > > To unsubscribe, e-mail:
> >>
>
> > general-unsubscribe@.apache
>
> >> > > > > > > > For additional commands, e-mail:
> >> > >
>
> > general-help@.apache
>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> ---------------------------------------------------------------------
> >> > > > To unsubscribe, e-mail:
>
> > general-unsubscribe@.apache
>
> >> > > > For additional commands, e-mail:
>
> > general-help@.apache
>
> >> > > >
> >> > > >
> >> > >
> >> > --
> >> > From my cell phone.
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
>
> > general-unsubscribe@.apache
>
> >> For additional commands, e-mail:
>
> > general-help@.apache
>
> >>
> >>
>
>
>
>
>
> --
> Sent from: http://apache-incubator-general.996316.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: DataSketches Proposal - Google Docs Link

Posted by Liang Chen <ch...@gmail.com>.

Hi Kenneth

Please try this link :
https://docs.google.com/document/d/1_cnesVLtKqPeUYxJvsd_2MTFwgeC1wUqI6cDPCbBRSM/edit#heading=h.97rxea60t2yw

Regards
Liang


Kenneth Knowles wrote
> I could not access that document. I suggest you need to turn on link
> sharing.
> 
> Kenn
> 
> On Mon, Feb 25, 2019 at 12:00 PM 

> leerho@

>  &lt;

> leerho@

> &gt; wrote:
> 
>> Try this link:
>> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
>>
>>
>> On 2019/02/25 05:55:50, leerho &lt;

> leerho@

> &gt; wrote:
>> > Yes I will try that tomorrow.
>> >
>> > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles &lt;

> kenn@

> &gt; wrote:
>> >
>> > > Can you share the Google doc with the proposal? Per Ted's advice, we
>> can
>> > > iterate quickly there and move it to the wiki when it becomes a bit
>> more
>> > > stable.
>> > >
>> > > Kenn
>> > >
>> > > On Fri, Feb 22, 2019 at 10:21 PM 

> leerho@

>  &lt;

> leerho@

> &gt;
>> > > wrote:
>> > >
>> > > > Thanks for the offer.  i am a neophyte at this process and email
>> app!   I
>> > > > could use a lot of help getting this off the ground!  Also, I'm not
>> sure
>> > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
>> > > >
>> > > > Lee.
>> > > >
>> > > > On 2019/02/23 06:03:58, Kenneth Knowles &lt;

> kenn@

> &gt; wrote:
>> > > > > Nice.
>> > > > >
>> > > > > I would very much like to help mentor this project, though you
>> already
>> > > > have
>> > > > > a couple good ones.
>> > > > >
>> > > > > I concur with incubator as sponsoring entity.
>> > > > >
>> > > > > Kenn (VP Apache Beam)
>> > > > >
>> > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho &lt;

> leerho@

> &gt; wrote:
>> > > > >
>> > > > > > I didn't realize that this mail list does not accept PDF files,
>> > > > apparently
>> > > > > > only text.  So let me try one more time ... :)  Please let me
>> know if
>> > > > > > this works!
>> > > > > >
>> > > > > >
>> > > > > > = Apache DataSketches Proposal[1] =
>> > > > > >
>> > > > > > == Abstract ==
>> > > > > >
>> > > > > > DataSketches.GitHub.io is an open source, high-performance
>> library
>> > > of
>> > > > > > stochastic streaming algorithms commonly called "sketches" in
>> the
>> > > data
>> > > > > > sciences. Sketches are small, stateful programs that process
>> massive
>> > > > data
>> > > > > > as a stream and can provide approximate answers, with
>> mathematical
>> > > > > > guarantees, to computationally difficult queries
>> orders-of-magnitude
>> > > > faster
>> > > > > > than traditional, exact methods.
>> > > > > >
>> > > > > > This proposal is to move DataSketches to the Apache Software
>> > > > > > Foundation(ASF) transferring ownership of its copyright
>> intellectual
>> > > > > > property to the ASF.  Thereafter, DataSketches would be
>> officially
>> > > > known as
>> > > > > > Apache DataSketches and its evolution and governance would come
>> under
>> > > > the
>> > > > > > rules and guidance of the ASF.
>> > > > > >
>> > > > > > == Introduction ==
>> > > > > >
>> > > > > > The DataSketches library contains carefully crafted
>> implementations
>> > > of
>> > > > > > sketch algorithms that meet rigorous standards of quality and
>> > > > performance
>> > > > > > and provide capabilities required for large-scale production
>> systems
>> > > > that
>> > > > > > must process and analyze massive data. The DataSketches core
>> > > > repository is
>> > > > > > written in Java with a parallel core repository written in C++
>> that
>> > > > > > includes Python wrappers. The DataSketches library also
>> includes
>> > > > special
>> > > > > > repositories for extending the core library for Apache Hive and
>> > > Apache
>> > > > Pig.
>> > > > > > The sketches developed in the different languages share a
>> common
>> > > binary
>> > > > > > storage format so that sketches created and stored in Java, for
>> > > > example,
>> > > > > > can be fully used in C++, and visa versa.  Because the stored
>> sketch
>> > > > > > "images" are just a "blob" of bytes (similar to picture
>> images),
>> they
>> > > > can
>> > > > > > be shared across many different systems, languages and
>> platforms.
>> > > > > >
>> > > > > > The DataSketches documentation website,
>> > > https://datasketches.github.io
>> > > > ,
>> > > > > > includes general tutorials, a comprehensive research section
>> with
>> > > > > > references to relevant academic papers, extensive examples for
>> using
>> > > > the
>> > > > > > core library directly as well as examples for accessing the
>> library
>> > > in
>> > > > > > Hive, Pig, and Apache Spark.
>> > > > > >
>> > > > > > The DataSketches library also includes a characterization
>> repository
>> > > > for
>> > > > > > long running test programs that are used for studying accuracy
>> and
>> > > > > > performance of these sketches over wide ranges of input
>> variables.
>> > > The
>> > > > data
>> > > > > > produced by these programs is used for generating the many
>> > > performance
>> > > > > > plots contained in the documentation website and for academic
>> > > > > > publications.
>> > > > > >
>> > > > > > The code repositories used for production are versioned and
>> published
>> > > > to
>> > > > > > Maven Central on periodic intervals as the library evolves.
>> > > > > >
>> > > > > > The DataSketches library also includes several experimental
>> > > > repositories
>> > > > > > for use-cases outside the large-scale systems environments,
>> such
>> as
>> > > > > > sketches for mobile, IoT devices (Android), command-line access
>> of
>> > > the
>> > > > > > sketch library, and an experimental repository for vector-based
>> > > > sketches
>> > > > > > that performs approximate Singular Value Decomposition (SVD)
>> analysis
>> > > > that
>> > > > > > could potentially be used in Machine Learning (ML)
>> applications.
>> > > > > >
>> > > > > > == Background ==
>> > > > > >
>> > > > > > The DataSketches library was started in 2012 as internal Yahoo
>> > > project
>> > > > to
>> > > > > > dramatically reduce time and resources required for distinct
>> (unique)
>> > > > > > counting.  An extensive search on the Internet at the time
>> yielded a
>> > > > number
>> > > > > > of theoretical papers on stochastic streaming algorithms with
>> > > > pseudocode
>> > > > > > examples, but we did not find any usable open-source code of
>> the
>> > > > quality we
>> > > > > > felt we needed for our internal production systems.  So we
>> started a
>> > > > small
>> > > > > > project (one person) to develop our own sketches working
>> directly
>> > > from
>> > > > > > published theoretical papers.
>> > > > > >
>> > > > > > The DataSketches library was designed from the start with the
>> > > > objective of
>> > > > > > making these algorithms, usually only described in theoretical
>> > > papers,
>> > > > > > easily accessible to systems developers for use in our internal
>> > > > production
>> > > > > > systems. By necessity, the code had to be of the highest
>> quality
>> and
>> > > > > > thoroughly tested. The wide variety of our internal production
>> > > systems
>> > > > > > drove the requirement that the sketch implementations had to
>> have an
>> > > > > > absolute minimum of external, run-time dependencies in order to
>> > > > simplify
>> > > > > > integration and troubleshooting.
>> > > > > >
>> > > > > > Our internal experiments demonstrated dramatic positive impact
>> on the
>> > > > > > performance of our systems.  As a result, the DataSketches
>> library
>> > > > quickly
>> > > > > > evolved to include different types of sketches for different
>> types of
>> > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
>> algorithms,
>> > > > > > quantile/histogram algorithms, and weighted and unweighted
>> sampling
>> > > > > > algorithms.
>> > > > > >
>> > > > > > We quickly discovered that developing these sketch algorithms
>> to
>> be
>> > > > truly
>> > > > > > robust in production environments is quite difficult and
>> requires
>> > > deep
>> > > > > > understanding of the underlying mathematics and statistics as
>> well as
>> > > > > > extensive experience in developing high quality code for 24/7
>> > > > production
>> > > > > > systems. This is a difficult combination of skills for any one
>> > > > organization
>> > > > > > to collect and maintain over time. It became clear that this
>> > > technology
>> > > > > > needed a community larger than Yahoo to evolve.  In November,
>> 2015,
>> > > > this
>> > > > > > factor, along with Yahoo’s strong experience and support of
>> open
>> > > > source,
>> > > > > > led to the decision to open source this technology under an
>> Apache
>> > > 2.0
>> > > > > > license on GitHub. Since that time our community has expanded
>> > > > considerably
>> > > > > > and the key contributors to this effort includes leading
>> research
>> > > > > > scientists from a number of universities as well as
>> practitioners and
>> > > > > > researchers from a number of major corporations. The core of
>> this
>> > > > group is
>> > > > > > very active as we meet weekly to discuss research directions
>> and
>> > > > > > engineering priorities.
>> > > > > >
>> > > > > > It is important to note that our internal systems at Yahoo use
>> the
>> > > > current
>> > > > > > public GitHub open source DataSketches library and not an
>> internal
>> > > > version
>> > > > > > of the code.
>> > > > > >
>> > > > > > The close collaboration of scientific research and engineering
>> > > > development
>> > > > > > experience with actual massive-data processing systems has also
>> > > > produced
>> > > > > > new research publications in the field of stochastic streaming
>> > > > algorithms,
>> > > > > > for example:
>> > > > > >
>> > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
>> > > > Rhodes, and
>> > > > > > Justin Thaler. A high-performance algorithm for identifying
>> frequent
>> > > > items
>> > > > > > in data streams. In ACM IMC 2017.
>> > > > > >
>> > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
>> Thaler. A
>> > > > > > framework for estimating stream expression cardinalities. In
>> > > *EDBT/ICDT
>> > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
>> > > > > >
>> > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
>> Frequent
>> > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
>> Proceedings
>> > > > ‘16,
>> > > > > > pages 845-854, 2016.
>> > > > > >
>> > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
>> quantile
>> > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
>> 71–78,
>> > > > 2016.
>> > > > > >
>> > > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
>> > > > cardinality
>> > > > > > estimation algorithm. arXiv preprint
>> > > https://arxiv.org/abs/1708.06839,
>> > > > > > 2017.
>> > > > > >
>> > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In
>> ACM
>> KDD
>> > > > > > Proceedings ‘13, pages 581– 588, 2013.
>> > > > > >
>> > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
>> Jonathan
>> > > > Ullman.
>> > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
>> > > > Proceedings
>> > > > > > ‘16, pages 441–454, 2016.
>> > > > > >
>> > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
>> > > Hierarchical
>> > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
>> > > > Proceedings
>> > > > > > ‘12, pages 160–174, 2012.
>> > > > > >
>> > > > > > == The Rationale for Sketches ==
>> > > > > >
>> > > > > > In the analysis of big data there are often problem queries
>> that
>> > > don’t
>> > > > > > scale because they require huge compute resources and time to
>> > > generate
>> > > > > > exact results. Examples include count distinct, quantiles, most
>> > > > frequent
>> > > > > > items, joins, matrix computations, and graph analysis.
>> > > > > >
>> > > > > > If we can loosen the requirement of “exact” results from our
>> queries
>> > > > and be
>> > > > > > satisfied with approximate results, within some well understood
>> > > bounds
>> > > > of
>> > > > > > error, there is an entire branch of mathematics and data
>> science
>> that
>> > > > has
>> > > > > > evolved around developing algorithms that can produce
>> approximate
>> > > > results
>> > > > > > with mathematically well-defined error properties.
>> > > > > >
>> > > > > > With the additional requirements that these algorithms must be
>> small
>> > > > > > (compared to the size of the input data), sublinear (the size
>> of
>> the
>> > > > sketch
>> > > > > > must grow at a slower rate than the size of the input stream),
>> > > > streaming
>> > > > > > (they can only touch each data item once), and mergeable
>> (suitable
>> > > for
>> > > > > > distributed processing), defines a class of algorithms that can
>> be
>> > > > > > described as small, stochastic, streaming, sublinear mergeable
>> > > > algorithms,
>> > > > > > commonly called sketches (they also have other names, but we
>> will use
>> > > > the
>> > > > > > term sketches from here on).
>> > > > > >
>> > > > > > To be truly streaming and be able to process data in a single
>> pass,
>> > > > > > sketches must make absolute minimum assumptions about the input
>> > > stream.
>> > > > > > This is critically important, as there is no “second chance” to
>> > > > process the
>> > > > > > data.
>> > > > > >
>> > > > > > For example, sketches should not make assumptions about the
>> order of
>> > > > stream
>> > > > > > items, the stream length, the dynamic range of values, or the
>> > > > distribution
>> > > > > > of item occurrence frequencies. Sketches should be tolerant of
>> NaNs,
>> > > > Nulls
>> > > > > > and empty objects. About the only thing that the sketch needs
>> to
>> know
>> > > > about
>> > > > > > the stream is how to extract items from it and what type the
>> item is,
>> > > > e.g.,
>> > > > > > is it a numeric value or a string.
>> > > > > >
>> > > > > > As far as the sketch is concerned, the input stream is a
>> sequence of
>> > > > items
>> > > > > > in some unknown random order with unknown random values.
>> > > > > >
>> > > > > > The sketch is essentially a complex state machine and combined
>> with
>> > > the
>> > > > > > random input stream defines a stochastic process. We then apply
>> > > > > > probabilistic methods to interpret the states of the stochastic
>> > > > process in
>> > > > > > order to extract useful information about the input stream
>> itself.
>> > > The
>> > > > > > resulting information will be approximate, but we also use
>> additional
>> > > > > > probabilistic methods to extract an estimate of the likely
>> > > probability
>> > > > > > distribution of error.
>> > > > > >
>> > > > > > There is a significant scientific contribution here that is
>> defining
>> > > > the
>> > > > > > state machine, understanding the resulting stochastic process,
>> > > > developing
>> > > > > > the probabilistic methods, and proving mathematically, that it
>> all
>> > > > works!
>> > > > > > This is why the scientific contributors to this project are a
>> > > critical
>> > > > and
>> > > > > > strategic component to our success.  The development engineers
>> > > > translate
>> > > > > > the concepts of the proposed state machine and probabilistic
>> methods
>> > > > into
>> > > > > > production-quality code. Even more important, they work closely
>> with
>> > > > the
>> > > > > > scientists, feeding back system and user requirements, which
>> leads
>> > > not
>> > > > only
>> > > > > > to superior product design, but to new science as well.  A
>> number of
>> > > > > > scientific papers our members have published (see above) is a
>> direct
>> > > > result
>> > > > > > of this close collaboration.
>> > > > > >
>> > > > > > Because sketches are small they can be processed extremely
>> fast,
>> > > often
>> > > > many
>> > > > > > orders-of-magnitude faster than traditional exact computations.
>> For
>> > > > > > interactive queries there may not be other viable alternatives,
>> and
>> > > in
>> > > > the
>> > > > > > case of real-time analysis, sketches are the only known
>> solution.
>> > > > > >
>> > > > > > For any system that needs to extract useful information from
>> massive
>> > > > data
>> > > > > > sketches are essential tools that should be tightly integrated
>> into
>> > > the
>> > > > > > system’s analysis capabilities. This technology has helped
>> Yahoo
>> > > > > > successfully reduce data processing times from days to hours or
>> > > > minutes on
>> > > > > > a number of its internal platforms and has enabled subsecond
>> queries
>> > > on
>> > > > > > real-time platforms that would have been infeasible without
>> sketches.
>> > > > > > The Rationale for Apache DataSketches
>> > > > > > Other open source implementations of sketch algorithms can be
>> found
>> > > on
>> > > > the
>> > > > > > Internet. However, we have not yet found any open source
>> > > > implementations
>> > > > > > that are as comprehensive, engineered with the quality required
>> for
>> > > > > > production systems, and with usable and guaranteed error
>> properties.
>> > > > Large
>> > > > > > Internet companies, such as Google and Facebook, have published
>> > > papers
>> > > > on
>> > > > > > sketching, however, their implementations of their published
>> > > > algorithms are
>> > > > > > proprietary and not available as open source.
>> > > > > >
>> > > > > > The DataSketches library already provides integrations with a
>> number
>> > > of
>> > > > > > major Apache data processing platforms such as Apache Hive,
>> Apache
>> > > Pig,
>> > > > > > Apache Spark and Apache Druid, and is also integrated with a
>> number
>> > > of
>> > > > > > other open source data processing platforms such as Splice
>> Machine,
>> > > > GCHQ
>> > > > > > Gaffer and PostgreSQL.
>> > > > > >
>> > > > > > We believe that having DataSketches as an Apache project will
>> provide
>> > > > an
>> > > > > > immediate, worthwhile, and substantial contribution to the open
>> > > source
>> > > > > > community, will have a better opportunity to provide a
>> meaningful
>> > > > > > contribution to both the science and engineering of sketching
>> > > > algorithms,
>> > > > > > and integrate with other Apache projects.  In addition, this is
>> a
>> > > > > > significant opportunity for Apache to be the "go-to"
>> destination
>> for
>> > > > users
>> > > > > > that want to leverage this exciting technology.
>> > > > > >
>> > > > > > == Initial Goals ==
>> > > > > >
>> > > > > > We are breaking our initial goals into short-term (2-6 months)
>> and
>> > > > > > intermediate to long-term ( 6 months to 2 years):
>> > > > > >
>> > > > > > Our short-term goals include:
>> > > > > >
>> > > > > > * Understanding and adapting to the Apache development process
>> and
>> > > > > > structures.
>> > > > > >
>> > > > > > * Start refactoring codebase and move various DataSketches
>> > > repositories
>> > > > > > code to Apache Git repository.
>> > > > > >
>> > > > > > * Continue development of new features, functions, and fixes.
>> > > > > >
>> > > > > > * Specific sub-projects (e.g., C++ and Python) will continue to
>> be
>> > > > > > developed and expanded.
>> > > > > >
>> > > > > >
>> > > > > > The intermediate to long term goals include:
>> > > > > >
>> > > > > > * Completing the design and implementation of the C++ sketches
>> to
>> > > > > > complement what is already available in Java, and the Python
>> wrappers
>> > > > of
>> > > > > > those C++ sketches.
>> > > > > >
>> > > > > > * Expanding the C++ build framework to include Windows and the
>> > > popular
>> > > > > > Linux variants.
>> > > > > >
>> > > > > > * Continued engagement with the scientific research community
>> on
>> the
>> > > > > > development of new algorithms for computationally difficult
>> problems
>> > > > that
>> > > > > > heretofore have not had a sketching solution.
>> > > > > >
>> > > > > > == Current Status ==
>> > > > > >
>> > > > > > The DataSketches GitHub project has been quite successful.  As
>> of
>> > > this
>> > > > > > writing (Feb, 2019) the number of downloads measured by the
>> Nexus
>> > > > > > Repository Manager at https://oss.sonatype.org has grown by
>> nearly a
>> > > > > > factor
>> > > > > > of 10 over the past year to about 55 thousand per month. The
>> > > > > > DataSketches/sketches-core repository has about 560 stars and
>> 141
>> > > > forks,
>> > > > > > which is pretty good for a highly specialized library.
>> > > > > >
>> > > > > > === Development Practices ===
>> > > > > >
>> > > > > > ==== Source Control ====
>> > > > > >
>> > > > > > All of our developers have extensive experience with Git
>> version
>> > > > control
>> > > > > > and follow accepted practices for use of Pull Requests (PRs),
>> code
>> > > > reviews
>> > > > > > and commits to master, for example.
>> > > > > >
>> > > > > > ==== Testing ====
>> > > > > >
>> > > > > > Sketches, by their nature are probabilistic programs and don’t
>> > > > necessarily
>> > > > > > behave deterministically.  For some of the sketches we
>> intentionally
>> > > > insert
>> > > > > > random noise into the code as this gives us the mathematical
>> > > properties
>> > > > > > that we need to guarantee accuracy.  This can make the behavior
>> of
>> > > > these
>> > > > > > algorithms quite unintuitive and provides significant
>> challenges
>> to
>> > > the
>> > > > > > developer who wishes to test these algorithms for correctness.
>> As a
>> > > > result,
>> > > > > > our testing strategy includes two major components: unit tests,
>> and
>> > > > > > characterization tests.
>> > > > > >
>> > > > > > ===== Unit Testing =====
>> > > > > >
>> > > > > > Our unit tests are primarily quick tests to make sure that we
>> > > exercise
>> > > > all
>> > > > > > critical paths in the code and that key branches are executed
>> > > > correctly. It
>> > > > > > is important that they execute relatively fast as they are
>> generally
>> > > > run on
>> > > > > > every code build. The sketches-core repository alone has about
>> 22
>> > > > thousand
>> > > > > > statements, over 1300 unit tests and code coverage of about
>> 98.2% as
>> > > > > > measured by Atlassian/Clover.  It is our goal for all of our
>> code
>> > > > > > repositories that are used in production that they have code
>> coverage
>> > > > > > greater than 90%.
>> > > > > >
>> > > > > > ===== Characterization Testing =====
>> > > > > >
>> > > > > > In order to test the probabilistic methods that are used to
>> interpret
>> > > > the
>> > > > > > stochastic behaviors of our sketches we have a separate
>> > > > characterization
>> > > > > > repository that is dedicated to this.  To measure accuracy, for
>> > > > example,
>> > > > > > requires running thousands of trials at each of many different
>> points
>> > > > along
>> > > > > > the domain axis. Each trial compares its estimated results
>> against a
>> > > > known
>> > > > > > exact result producing an error for that trial.  These error
>> > > > measurements
>> > > > > > are then fed into our Quantiles sketch to capture the actual
>> > > > distribution
>> > > > > > of error at that point along the axis. We then select quantile
>> > > contours
>> > > > > > across all the distributions at points along the axis.  These
>> > > contours
>> > > > can
>> > > > > > then be plotted to reveal the shape of the actual error
>> distribution.
>> > > > These
>> > > > > > distributions are not at all Gaussian, in fact they can be
>> quite
>> > > > complex.
>> > > > > > Nonetheless, these distributions are then checked against our
>> > > > statistical
>> > > > > > guarantees inherent to the specific sketch algorithm and its
>> > > > parameters.
>> > > > > > There are many examples of these characterization error
>> distributions
>> > > > on
>> > > > > > our website. The runtimes of these tests can be very long and
>> can
>> > > range
>> > > > > > from many minutes to hours, and some can run for days.
>> Currently, we
>> > > > have
>> > > > > > separate characterization repositories for Java and C++ /
>> Python.
>> > > > > >
>> > > > > > It is our goal that we perform this characterization analysis
>> for all
>> > > > of
>> > > > > > our sketches.  By definition, the code that runs these
>> > > characterization
>> > > > > > tests is open-source so others can run these tests as well.  We
>> do
>> > > not
>> > > > have
>> > > > > > formal releases of this code (because it is not production
>> code)
>> and
>> > > > it is
>> > > > > > not published to Maven Central.
>> > > > > >
>> > > > > > === Meritocracy ===
>> > > > > >
>> > > > > > DataSketches was initially developed based on requirements
>> within
>> > > > Yahoo. As
>> > > > > > a project on GitHub, DataSketches has received contributions
>> from
>> > > > numerous
>> > > > > > individual developers from around the world, dedicated research
>> work
>> > > > from
>> > > > > > senior scientists at Amazon and Visa, and academic researchers
>> from
>> > > > > > Georgetown University, Princeton, and MIT.
>> > > > > >
>> > > > > > As a project under incubation, we are committed to expanding
>> our
>> > > > effort to
>> > > > > > build an environment which supports a meritocracy. We are
>> focused on
>> > > > > > engaging the community and other related projects for support
>> and
>> > > > > > contributions. Moreover, we are committed to ensure
>> contributors
>> and
>> > > > > > committers to DataSketches come from a broad mix of
>> organizations
>> > > > through a
>> > > > > > merit-based decision process during incubation. We believe
>> strongly
>> > > in
>> > > > the
>> > > > > > DataSketches premise that fulfills the concept of a well
>> engineered
>> > > and
>> > > > > > scientifically rigorous library that implements these powerful
>> > > > algorithms
>> > > > > > and are committed to growing an inclusive community of
>> DataSketches
>> > > > > > contributors and users.
>> > > > > >
>> > > > > > === Community ===
>> > > > > >
>> > > > > > Yahoo has a long history and active engagement in the Open
>> Source
>> > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
>> > > Panoptes,
>> > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
>> TensorFlowOnSpark,
>> > > > gifshot,
>> > > > > > fluxible, as well as the creation, contribution and incubation
>> of
>> > > many
>> > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
>> > > > Zookeeper,
>> > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
>> > > > > >
>> > > > > > Every day, DataSketches is actively used by a organizations and
>> > > > > > institutions around the world for batch and stream processing
>> of
>> > > data.
>> > > > We
>> > > > > > believe acceptance will allow us to consolidate existing
>> > > > > > DataSketches-related work, grow the DataSketches community, and
>> > > deepen
>> > > > > > connections between DataSketches and other open source
>> projects.
>> > > > > >
>> > > > > > === Introduction to the Core Developers & Contributors ===
>> > > > > >
>> > > > > > The core developers and contributors for DataSketches are from
>> > > diverse
>> > > > > > backgrounds, but primarily are scientists that love engineering
>> and
>> > > > > > engineers that love science. A large part of the value we bring
>> comes
>> > > > from
>> > > > > > this synthesis.  These individuals have already contributed
>> > > > substantially
>> > > > > > to the code, algorithms, and/or mathematical proofs that form
>> the
>> > > > basis of
>> > > > > > the library.
>> > > > > >
>> > > > > > This core group also form the Initial Committers with write
>> > > > permissions to
>> > > > > > the repository. Those marked with (*) Meet weekly to plan the
>> > > research
>> > > > and
>> > > > > > engineering direction of the project.
>> > > > > >
>> > > > > > ==== Scientists That Love Engineering ====
>> > > > > >
>> > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
>> > > > Interests:
>> > > > > > distributed systems, scalable systems and platforms for big
>> data
>> > > > > > processing, concurrent algorithms and data structures,
>> > > > > >
>> > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
>> > > > Sunnyvale,
>> > > > > > California. Interests: algorithms, theoretical and applied
>> > > mathematics,
>> > > > > > encoding and compression theory, theoretical and applied
>> performance
>> > > > > > optimization.
>> > > > > >
>> > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
>> Labs,
>> Palo
>> > > > Alto,
>> > > > > > California. Manages the algorithms group at Amazon AI. We build
>> > > > scalable
>> > > > > > machine learning systems and algorithms which are used both
>> > > internally
>> > > > and
>> > > > > > externally by customers of SageMaker, AWS's flagship machine
>> learning
>> > > > > > platform.
>> > > > > >
>> > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
>> Interests:
>> > > > > > Computational advertising, machine learning, speech
>> recognition,
>> > > > > > data-driven analysis, large scale experimentation, big data,
>> > > > stream/complex
>> > > > > > event processing
>> > > > > >
>> > > > > > * Justin Thaler: (*) Assistant Professor, Department of
>> Computer
>> > > > Science,
>> > > > > > Georgetown University, Washington D.C. Interests: algorithms
>> and
>> > > > > > computational complexity, complexity theory, quantum
>> algorithms,
>> > > > private
>> > > > > > data analysis, and learning theory, developing efficient
>> streaming
>> > > and
>> > > > > > sketching algorithms
>> > > > > >
>> > > > > > ==== Engineers That Love Science ====
>> > > > > >
>> > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
>> Snap.
>> > > > Interests:
>> > > > > > design and implementation of data storing and data processing
>> > > > (distributed)
>> > > > > > systems, performance optimization, CPU performance, mechanical
>> > > > sympathy,
>> > > > > > JVM performance, API design, databases, (concurrent) data
>> structures,
>> > > > > > memory management, garbage collection algorithms, language
>> design and
>> > > > > > runtimes (their tradeoffs), distributed systems (cloud)
>> efficiency,
>> > > > Linux,
>> > > > > > code quality, code transformation, pure functional programming
>> > > models,
>> > > > > > Haskell.
>> > > > > >
>> > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
>> founder
>> > > > of
>> > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
>> Interests:
>> > > > > > streaming algorithms, mathematics, computer science, high
>> quality and
>> > > > high
>> > > > > > performance code for the analysis of massive data, bridging the
>> > > divide
>> > > > > > between theory and practice.
>> > > > > >
>> > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
>> Sunnyvale,
>> > > > > > California. Interests: applied mathematics, computer science,
>> big
>> > > data,
>> > > > > > distributed systems.
>> > > > > >
>> > > > > > === Introduction to Additional Interested Contributors ===
>> > > > > >
>> > > > > > These folks have been intermittently involved and contributed,
>> but
>> > > are
>> > > > > > strong supporters of this project.
>> > > > > >
>> > > > > > * Frank Grimes: GitHub ID: frankgrimes97
>> > > > > >
>> > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
>> Computer
>> > > > Science,
>> > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
>> > > > > > approximation, streaming algorithms, randomized linear algebra.
>> > > > > >
>> > > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
>> > > > Computer
>> > > > > > Science, Research Instructor, Princeton University. Interests:
>> > > > algorithmic
>> > > > > > foundations of data science and machine learning, efficient
>> methods
>> > > for
>> > > > > > processing and understanding large datasets, often working at
>> the
>> > > > > > intersection of theoretical computer science, numerical linear
>> > > > algebra, and
>> > > > > > optimization.
>> > > > > >
>> > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
>> Computer
>> > > > Science,
>> > > > > > Professor, Warwick University, Warwick, England. Interests: all
>> > > > aspects of
>> > > > > > the "data lifecycle", from data collection and cleaning,
>> through
>> > > > mining and
>> > > > > > analytics. (Professor Cormode is one of the world’s leading
>> > > scientists
>> > > > in
>> > > > > > sketching algorithms)
>> > > > > >
>> > > > > > === Alignment ===
>> > > > > >
>> > > > > > The DataSketches library already provides integrations and
>> example
>> > > > code for
>> > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated
>> into
>> > > > Apache
>> > > > > > Druid.
>> > > > > >
>> > > > > > == Known Risks ==
>> > > > > >
>> > > > > > The following subsections are specific risks that have been
>> > > identified
>> > > > by
>> > > > > > the ASF that need to be addressed.
>> > > > > >
>> > > > > > === Risk: Orphaned Products ===
>> > > > > >
>> > > > > > The DataSketches library is presently used by a number of
>> > > > organizations,
>> > > > > > from small startups to Fortune 100 companies, to construct
>> production
>> > > > > > pipelines that must process and analyze massive data. Yahoo has
>> a
>> > > > long-term
>> > > > > > commitment to continue to advance the DataSketches library;
>> moreover,
>> > > > > > DataSketches is seeing increasing interest, development, and
>> adoption
>> > > > from
>> > > > > > many diverse organizations from around the world. Due to its
>> growing
>> > > > > > adoption, we feel it is quite unlikely that this project would
>> become
>> > > > > > orphaned.
>> > > > > >
>> > > > > > === Risk: Inexperience with Open Source ===
>> > > > > >
>> > > > > > Yahoo believes strongly in open source and the exchange of
>> > > information
>> > > > to
>> > > > > > advance new ideas and work. Examples of this commitment are
>> active
>> > > open
>> > > > > > source projects such as those mentioned above. With
>> DataSketches, we
>> > > > have
>> > > > > > been increasingly open and forward-looking; we have published a
>> > > number
>> > > > of
>> > > > > > papers about breakthrough developments in the science of
>> streaming
>> > > > > > algorithms (mentioned above) that also reference the
>> DataSketches
>> > > > library.
>> > > > > > Our submission to the Apache Software Foundation is a logical
>> > > > extension of
>> > > > > > our commitment to open source software.
>> > > > > >
>> > > > > > Key committers at Yahoo with strong open source backgrounds
>> include
>> > > > Aaron
>> > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
>> > > Andrews
>> > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
>> Call,
>> > > Daryn
>> > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
>> > > Hillel,
>> > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
>> > > Perez-Sorrosal,
>> > > > Gil
>> > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
>> James
>> > > > Penick,
>> > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
>> Eagles,
>> > > > Kihwal
>> > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
>> Trelinski,
>> > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
>> > > Natkovich,
>> > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby
>> Loo,
>> > > > Ryan
>> > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
>> Chan,
>> > > Sri
>> > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
>> more.
>> > > > > >
>> > > > > > All of our core developers are committed to learn about the
>> Apache
>> > > > process
>> > > > > > and to give back to the community.
>> > > > > >
>> > > > > > === Risk: Homogeneous Developers ===
>> > > > > >
>> > > > > > The majority of committers in this proposal belong to Yahoo due
>> to
>> > > the
>> > > > fact
>> > > > > > that DataSketches has emerged from an internal Yahoo project.
>> This
>> > > > proposal
>> > > > > > also includes developers and contributors from other companies,
>> and
>> > > > who are
>> > > > > > actively involved with other Apache projects, such as Druid. 
>> We
>> > > > expect our
>> > > > > > entry into incubation will allow us to expand the number of
>> > > > individuals and
>> > > > > > organizations participating in DataSketches development.
>> > > > > >
>> > > > > > === Risk: Reliance on Salaried Developers ===
>> > > > > >
>> > > > > > Because the DataSketches library originated within Yahoo, it
>> has
>> been
>> > > > > > developed primarily by salaried Yahoo developers and we expect
>> that
>> > > to
>> > > > > > continue to be the case near term. However, since we placed
>> this
>> > > > library
>> > > > > > into open-source we have had a number of significant
>> contributions
>> > > from
>> > > > > > engineers and scientists from outside of Yahoo. We expect our
>> > > reliance
>> > > > on
>> > > > > > Yahoo salaried developers will decrease over time. Nonetheless,
>> Yahoo
>> > > > is
>> > > > > > committed to continue its strong support of this important
>> project.
>> > > > > >
>> > > > > > === Risk: Lack of Relationship to other Apache Products ===
>> > > > > >
>> > > > > > DataSketches already directly interoperates with or utilizes
>> several
>> > > > > > existing Apache projects.
>> > > > > >
>> > > > > > * Build
>> > > > > >    * Apache Maven
>> > > > > >
>> > > > > > * Integrations and adaptors for the following projects
>> naturally
>> have
>> > > > them
>> > > > > > as dependencies
>> > > > > >    * Apache Hive
>> > > > > >    * Apache Pig
>> > > > > >    * Apache Druid
>> > > > > >    * Apache Spark
>> > > > > >
>> > > > > > * Additional dependencies for the above integrations and
>> adaptors
>> > > > include
>> > > > > >    * Apache Hadoop
>> > > > > >    * Apache Commons (Math)
>> > > > > >
>> > > > > > There is no other Apache project that we are aware of that
>> duplicates
>> > > > the
>> > > > > > functionality of the DataSketches library.
>> > > > > >
>> > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
>> > > > > >
>> > > > > > With this proposal we are not seeking attention or publicity.
>> Rather,
>> > > > we
>> > > > > > firmly believe in the DataSketches library and concept and the
>> > > ability
>> > > > to
>> > > > > > make the DataSketches library a powerful, yet simple-to-use
>> toolkit
>> > > for
>> > > > > > data processing. While the DataSketches library has been open
>> source,
>> > > > we
>> > > > > > believe putting code on GitHub can only go so far. We see the
>> Apache
>> > > > > > community, processes, and mission as critical for ensuring the
>> > > > DataSketches
>> > > > > > library is truly community-driven, positively impactful, and
>> > > innovative
>> > > > > > open source software. While Yahoo has taken a number of steps
>> to
>> > > > advance
>> > > > > > its various open source projects, we believe the DataSketches
>> library
>> > > > > > project is a great fit for the Apache Software Foundation due
>> to
>> its
>> > > > focus
>> > > > > > on data processing and its relationships to existing ASF
>> projects.
>> > > > > >
>> > > > > > === Risk: Cryptography ===
>> > > > > >
>> > > > > > DataSketches does not contain any cryptographic code and is not
>> a
>> > > > > > cryptographic product.
>> > > > > >
>> > > > > > == Documentation ==
>> > > > > >
>> > > > > > The following documentation is relevant to this proposal.
>> Relevant
>> > > > portions
>> > > > > > of the documentation will be contributed to the Apache
>> DataSketches
>> > > > > > project.
>> > > > > >
>> > > > > > * DataSketches website: https://datasketches.github.io.
>> > > > > >
>> > > > > > * DataSketches website repository:
>> > > > > > https://github.com/DataSketches/DataSketches.github.io
>> > > > > >
>> > > > > > We will need an apache website for this documentation similar
>> to
>> > > > > >
>> > > > > > * https://datasketches.apache.org
>> > > > > >
>> > > > > > == Initial Source ==
>> > > > > >
>> > > > > > The initial source for DataSketches which we will submit to the
>> > > Apache
>> > > > > > Foundation will include a number of repositories which are
>> currently
>> > > > hosted
>> > > > > > under the GitHub.com/datasketches organization:
>> > > > > >
>> > > > > > All github.com/datasketches repositories including:
>> > > > > >
>> > > > > > * Java
>> > > > > >    * sketches-core: This repository has the core sketching
>> classes,
>> > > > which
>> > > > > > are leveraged by some of the other repositories. This
>> repository
>> has
>> > > no
>> > > > > > external dependencies outside of the DataSketches/memory
>> repository,
>> > > > Java
>> > > > > > and TestNG for unit tests. This code is versioned and the
>> latest
>> > > > release
>> > > > > > can be obtained from Maven Central.
>> > > > > >    * memory: Low level, high-performance memory data-structure
>> > > > management
>> > > > > > primarily for off-heap.
>> > > > > >    * sketches-android: This is a new repository dedicated to
>> sketches
>> > > > > > designed to be run in a mobile client, such as a cell phone. It
>> is
>> > > > still in
>> > > > > > development and should be considered experimental.
>> > > > > >    * sketches-hive: This repository contains Hive UDFs and
>> UDAFs
>> for
>> > > > use
>> > > > > > within Hadoop grid environments. This code has dependencies on
>> > > > > > sketches-core as well as Hadoop and Hive. Users of this code
>> are
>> > > > advised to
>> > > > > > use Maven to bring in all the required dependencies. This code
>> is
>> > > > versioned
>> > > > > > and the latest release can be obtained from Maven Central.
>> > > > > >    * sketches-pig: This repository contains Pig User Defined
>> > > Functions
>> > > > > > (UDF) for use within Hadoop grid environments. This code has
>> > > > dependencies
>> > > > > > on sketches-core as well as Hadoop and Pig. Users of this code
>> are
>> > > > advised
>> > > > > > to use Maven to bring in all the required dependencies. This
>> code is
>> > > > > > versioned and the latest release can be obtained from Maven
>> Central.
>> > > > > >    * sketches-vector: This is a new repository dedicated to
>> sketches
>> > > > for
>> > > > > > vector and matrix operations. It is still somewhat
>> experimental.
>> > > > > >    * characterization: This relatively new repository is for
>> code
>> > > that
>> > > > we
>> > > > > > use to characterize the accuracy and speed performance of the
>> > > sketches
>> > > > in
>> > > > > > the library and is constantly being updated. Examples of the
>> job
>> > > > command
>> > > > > > files used for various tests can be found in the
>> src/main/resources
>> > > > > > directory. Some of these tests can run for hours depending on
>> its
>> > > > > > configuration.
>> > > > > >    * experimental: This repository is an experimental staging
>> area
>> > > for
>> > > > code
>> > > > > > that will eventually end up in another repository. This code is
>> not
>> > > > > > versioned and not registered with Maven Central.
>> > > > > >    * sketches-misc: Demos and other code not related to
>> production
>> > > > > > deployment
>> > > > > >
>> > > > > > * C++ and Python
>> > > > > >    * sketches-core-cpp: This is the C++/Python companion to the
>> Java
>> > > > > > sketches-core. These implementations are binary compatible with
>> their
>> > > > > > counterparts in Java. In other words, a sketch created and
>> stored in
>> > > > C++
>> > > > > > can be opened and read in Java and visa-versa. This site also
>> has our
>> > > > > > Python adaptors that basically wrap the C++ implementations,
>> making
>> > > the
>> > > > > > high performance C++ implementations available from Python.
>> > > > > >    * sketches-postgres: This site provides the
>> postgres-specific
>> > > > adaptors
>> > > > > > that wrap the C++ implementations making them available to the
>> > > Postgres
>> > > > > > database users.
>> > > > > >    * characterization-cpp: This is the C++/Python companion to
>> the
>> > > Java
>> > > > > > characterization repository.
>> > > > > >    * experimental-cpp: This repository is an experimental
>> staging
>> > > area
>> > > > for
>> > > > > > C++ code that will eventually end up in another repository.
>> > > > > >
>> > > > > > * Command-Line Tools
>> > > > > >    * sketches-cmd
>> > > > > >    * homebrew-sketches
>> > > > > >    * homebrew-sketches-cmd
>> > > > > >
>> > > > > > These projects have always been Apache 2.0 licensed. We intend
>> to
>> > > > bundle
>> > > > > > all of these repositories since they are all complementary and
>> should
>> > > > be
>> > > > > > maintained in one project. Prior to our submission, we will
>> combine
>> > > > all of
>> > > > > > these projects into a new git repository.
>> > > > > >
>> > > > > > == Source and Intellectual Property Submission Plan ==
>> > > > > >
>> > > > > > Contributors to the DataSketches project have also signed the
>> Yahoo
>> > > > > > Individual Contributor License Agreement (
>> > > > https://yahoocla.herokuapp.com/
>> > > > > > in order to contribute to the project.
>> > > > > >
>> > > > > > With respect to trademark rights, Yahoo does not hold a
>> trademark on
>> > > > the
>> > > > > > phrase “DataSketches.” Based on feedback and guidance we
>> receive
>> > > > during the
>> > > > > > incubation process, we are open to renaming the project if
>> necessary
>> > > > for
>> > > > > > trademark or other concerns, but we would prefer not to have to
>> do
>> > > > that.
>> > > > > >
>> > > > > > == External Dependencies ==
>> > > > > >
>> > > > > > All external dependencies are licensed under an Apache 2.0 or
>> > > > > > Apache-compatible license. As we grow the DataSketches
>> community
>> we
>> > > > will
>> > > > > > configure our build process to require and validate all
>> contributions
>> > > > and
>> > > > > > dependencies are licensed under the Apache 2.0 license or are
>> under
>> > > an
>> > > > > > Apache-compatible license.
>> > > > > >
>> > > > > > == Required Resources ==
>> > > > > >
>> > > > > > === Mailing Lists ===
>> > > > > >
>> > > > > > We currently use a mix of mailing lists. We will migrate our
>> existing
>> > > > > > mailing lists to the following:
>> > > > > >
>> > > > > > * 

> dev@.apache

>> > > > > >
>> > > > > > * 

> user@.apache

>> > > > > >
>> > > > > > * 

> private@.apache

>> > > > > >
>> > > > > > * 

> commits@.apache

>> > > > > >
>> > > > > > === Source Control ===
>> > > > > >
>> > > > > > The DataSketches team currently uses Git and would like to
>> continue
>> > > to
>> > > > do
>> > > > > > so. We request a Git repository for DataSketches with mirroring
>> to
>> > > > GitHub
>> > > > > > enabled similar the following:
>> > > > > >
>> > > > > > * https://github.com/apache/incubator-datasketches.git
>> > > > > >
>> > > > > > === Issue Tracking ===
>> > > > > >
>> > > > > > We request the creation of an Apache-hosted JIRA. The
>> DataSketches
>> > > > project
>> > > > > > is currently using the public GitHub issue tracker and the
>> public
>> > > > Google
>> > > > > > Groups forum/sketches-user for issue tracking and discussions.
>> We
>> > > will
>> > > > > > migrate and combine from these two sources to the Apache JIRA.
>> > > > > >
>> > > > > > Proposed Jira ID: DATASKETCHES
>> > > > > >
>> > > > > > == Initial Committers ==
>> > > > > >
>> > > > > > The following list of individuals have been extremely active in
>> our
>> > > > > > community and should have write (commit) permissions to the
>> > > repository.
>> > > > > >
>> > > > > > * Eshcar Hillel                      [eshcar at verizonmedia
>> dot
>> com]
>> > > > > >
>> > > > > > * Kevin Lang                    [langk at verizonmedia dot com]
>> > > > > >
>> > > > > > * Roman Leventov              [roman.leventov at c.metamarkets
>> dot
>> > > com]
>> > > > > >
>> > > > > > * Edo Liberty                   [libertye at amazon dot com]
>> > > > > >
>> > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot
>> com]
>> > > > > >
>> > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com]
>> &
>> > > > [leerho
>> > > > > > at gmail dot com]
>> > > > > >
>> > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
>> > > > > >
>> > > > > > * Justin Thaler                 [justin.thaler at georgetown
>> dot
>> edu]
>> > > > > >
>> > > > > > == Affiliations ==
>> > > > > >
>> > > > > > The initial committers are from four organizations: Yahoo,
>> Amazon,
>> > > > > > Georgetown University, and Metamarkets/Snap.
>> > > > > >
>> > > > > > === Champion ===
>> > > > > > (Recommended to me: )
>> > > > > >
>> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
>> at
>> > > > apache
>> > > > > > dot org]
>> > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
>> > > > > >
>> > > > > > === Nominated Mentors ===
>> > > > > > (Recommended to me: )
>> > > > > >
>> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
>> at
>> > > > apache
>> > > > > > dot org]
>> > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
>> > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
>> > > > > >
>> > > > > > === Sponsoring Entity ===
>> > > > > >
>> > > > > > * The Apache Incubator    **** This is our 1st choice ****
>> > > > > >
>> > > > > > * Apache Druid. The incubating Apache Druid project might also
>> be a
>> > > > logical
>> > > > > > sponsor. However, DataSketches has applications in many areas
>> of
>> > > > computing
>> > > > > > outside of Druid so our preference and recommendation is that
>> > > > DataSketches
>> > > > > > would ultimately be a top-level Apache project.
>> > > > > >
>> > > > > > ________________
>> > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
>> previously
>> > > > acquired
>> > > > > > AOL. The merged entity was originally called Oath, Inc., but
>> has
>> > > > recently
>> > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
>> > > Verizon,
>> > > > > > Inc.  Since Yahoo is the more recognized name, references in
>> this
>> > > > document
>> > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
>> > > > > >
>> > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles &lt;

> kenn@

> &gt; >
>> > > > wrote:
>> > > > > >
>> > > > > > > The subject line has me interested already. Follow examples
>> like
>> > > this
>> > > > > > > maybe?
>> > > > > > >
>> > > > > > > 1.
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
>> > > > > > > 2.
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
>> > > > > > >
>> > > > > > > Kenn
>> > > > > > >
>> > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho &lt;

> leerho@

> &gt;
>> wrote:
>> > > > > > >
>> > > > > > > > I'll try again ... :)
>> > > > > > > >
>> > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
>> > > 

> ted.dunning@

>> > > > >
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> It didn't make it again
>> > > > > > > >>
>> > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho &lt;

> leerho@

> &gt;
>> wrote:
>> > > > > > > >>
>> > > > > > > >> > I'm not sure the attached document made it through.
>> > > > > > > >> >
>> > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho &lt;

> leerho@

> &gt;
>> > > > wrote:
>> > > > > > > >> >
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > > > > > To unsubscribe, e-mail:
>> 

> general-unsubscribe@.apache

>> > > > > > > > For additional commands, e-mail:
>> > > 

> general-help@.apache

>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: 

> general-unsubscribe@.apache

>> > > > For additional commands, e-mail: 

> general-help@.apache

>> > > >
>> > > >
>> > >
>> > --
>> > From my cell phone.
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> general-unsubscribe@.apache

>> For additional commands, e-mail: 

> general-help@.apache

>>
>>





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: DataSketches Proposal - Google Docs Link

Posted by leerho <le...@gmail.com>.

Adding individuals seems to be working.  I was able to add kenn@apache.org
successfully, and he successfully added a comment.  Casual readers can get
the gist from the text version I inserted into this thread.  Those that
wish to make comments make a request via the link and I will add them.

Will this work for now?

On Mon, Feb 25, 2019 at 3:30 PM Luciano Resende <lu...@gmail.com>
wrote:

> Should we move the proposal to the incubator wiki then?
>
> On Mon, Feb 25, 2019 at 15:26 leerho <le...@gmail.com> wrote:
>
> > Ken,
> > Yahoo does not allow me to create a shared link outside our company,
> except
> > to individual email addresses.  So attempting to share it to the email
> > general@incubator.apache.org may not work.  Nonetheless, several
> > individuals were able to request access using their individual email
> > accounts and I was able to add them.  I will try to add you using
> > kenn@apache.org, but if that doesn't work, I may need a gmail or
> > equivalent
> > account for you.
> >
> > Lee.
> >
> >
> > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > > I could not access that document. I suggest you need to turn on link
> > > sharing.
> > >
> > > Kenn
> > >
> > > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> > > wrote:
> > >
> > > > Try this link:
> > > >
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > >
> > > >
> > > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > > Yes I will try that tomorrow.
> > > > >
> > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > >
> > > > > > Can you share the Google doc with the proposal? Per Ted's advice,
> > we
> > > > can
> > > > > > iterate quickly there and move it to the wiki when it becomes a
> bit
> > > > more
> > > > > > stable.
> > > > > >
> > > > > > Kenn
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> > leerho@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the offer.  i am a neophyte at this process and
> email
> > > > app!   I
> > > > > > > could use a lot of help getting this off the ground!  Also, I'm
> > not
> > > > sure
> > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on
> > :)
> > > > > > >
> > > > > > > Lee.
> > > > > > >
> > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org>
> wrote:
> > > > > > > > Nice.
> > > > > > > >
> > > > > > > > I would very much like to help mentor this project, though
> you
> > > > already
> > > > > > > have
> > > > > > > > a couple good ones.
> > > > > > > >
> > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > >
> > > > > > > > Kenn (VP Apache Beam)
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > > I didn't realize that this mail list does not accept PDF
> > files,
> > > > > > > apparently
> > > > > > > > > only text.  So let me try one more time ... :)  Please let
> me
> > > > know if
> > > > > > > > > this works!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > >
> > > > > > > > > == Abstract ==
> > > > > > > > >
> > > > > > > > > DataSketches.GitHub.io is an open source, high-performance
> > > > library
> > > > > > of
> > > > > > > > > stochastic streaming algorithms commonly called "sketches"
> in
> > > the
> > > > > > data
> > > > > > > > > sciences. Sketches are small, stateful programs that
> process
> > > > massive
> > > > > > > data
> > > > > > > > > as a stream and can provide approximate answers, with
> > > > mathematical
> > > > > > > > > guarantees, to computationally difficult queries
> > > > orders-of-magnitude
> > > > > > > faster
> > > > > > > > > than traditional, exact methods.
> > > > > > > > >
> > > > > > > > > This proposal is to move DataSketches to the Apache
> Software
> > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > intellectual
> > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > officially
> > > > > > > known as
> > > > > > > > > Apache DataSketches and its evolution and governance would
> > come
> > > > under
> > > > > > > the
> > > > > > > > > rules and guidance of the ASF.
> > > > > > > > >
> > > > > > > > > == Introduction ==
> > > > > > > > >
> > > > > > > > > The DataSketches library contains carefully crafted
> > > > implementations
> > > > > > of
> > > > > > > > > sketch algorithms that meet rigorous standards of quality
> and
> > > > > > > performance
> > > > > > > > > and provide capabilities required for large-scale
> production
> > > > systems
> > > > > > > that
> > > > > > > > > must process and analyze massive data. The DataSketches
> core
> > > > > > > repository is
> > > > > > > > > written in Java with a parallel core repository written in
> > C++
> > > > that
> > > > > > > > > includes Python wrappers. The DataSketches library also
> > > includes
> > > > > > > special
> > > > > > > > > repositories for extending the core library for Apache Hive
> > and
> > > > > > Apache
> > > > > > > Pig.
> > > > > > > > > The sketches developed in the different languages share a
> > > common
> > > > > > binary
> > > > > > > > > storage format so that sketches created and stored in Java,
> > for
> > > > > > > example,
> > > > > > > > > can be fully used in C++, and visa versa.  Because the
> stored
> > > > sketch
> > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > images),
> > > > they
> > > > > > > can
> > > > > > > > > be shared across many different systems, languages and
> > > platforms.
> > > > > > > > >
> > > > > > > > > The DataSketches documentation website,
> > > > > > https://datasketches.github.io
> > > > > > > ,
> > > > > > > > > includes general tutorials, a comprehensive research
> section
> > > with
> > > > > > > > > references to relevant academic papers, extensive examples
> > for
> > > > using
> > > > > > > the
> > > > > > > > > core library directly as well as examples for accessing the
> > > > library
> > > > > > in
> > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > >
> > > > > > > > > The DataSketches library also includes a characterization
> > > > repository
> > > > > > > for
> > > > > > > > > long running test programs that are used for studying
> > accuracy
> > > > and
> > > > > > > > > performance of these sketches over wide ranges of input
> > > > variables.
> > > > > > The
> > > > > > > data
> > > > > > > > > produced by these programs is used for generating the many
> > > > > > performance
> > > > > > > > > plots contained in the documentation website and for
> academic
> > > > > > > > > publications.
> > > > > > > > >
> > > > > > > > > The code repositories used for production are versioned and
> > > > published
> > > > > > > to
> > > > > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > > > > >
> > > > > > > > > The DataSketches library also includes several experimental
> > > > > > > repositories
> > > > > > > > > for use-cases outside the large-scale systems environments,
> > > such
> > > > as
> > > > > > > > > sketches for mobile, IoT devices (Android), command-line
> > access
> > > > of
> > > > > > the
> > > > > > > > > sketch library, and an experimental repository for
> > vector-based
> > > > > > > sketches
> > > > > > > > > that performs approximate Singular Value Decomposition
> (SVD)
> > > > analysis
> > > > > > > that
> > > > > > > > > could potentially be used in Machine Learning (ML)
> > > applications.
> > > > > > > > >
> > > > > > > > > == Background ==
> > > > > > > > >
> > > > > > > > > The DataSketches library was started in 2012 as internal
> > Yahoo
> > > > > > project
> > > > > > > to
> > > > > > > > > dramatically reduce time and resources required for
> distinct
> > > > (unique)
> > > > > > > > > counting.  An extensive search on the Internet at the time
> > > > yielded a
> > > > > > > number
> > > > > > > > > of theoretical papers on stochastic streaming algorithms
> with
> > > > > > > pseudocode
> > > > > > > > > examples, but we did not find any usable open-source code
> of
> > > the
> > > > > > > quality we
> > > > > > > > > felt we needed for our internal production systems.  So we
> > > > started a
> > > > > > > small
> > > > > > > > > project (one person) to develop our own sketches working
> > > directly
> > > > > > from
> > > > > > > > > published theoretical papers.
> > > > > > > > >
> > > > > > > > > The DataSketches library was designed from the start with
> the
> > > > > > > objective of
> > > > > > > > > making these algorithms, usually only described in
> > theoretical
> > > > > > papers,
> > > > > > > > > easily accessible to systems developers for use in our
> > internal
> > > > > > > production
> > > > > > > > > systems. By necessity, the code had to be of the highest
> > > quality
> > > > and
> > > > > > > > > thoroughly tested. The wide variety of our internal
> > production
> > > > > > systems
> > > > > > > > > drove the requirement that the sketch implementations had
> to
> > > > have an
> > > > > > > > > absolute minimum of external, run-time dependencies in
> order
> > to
> > > > > > > simplify
> > > > > > > > > integration and troubleshooting.
> > > > > > > > >
> > > > > > > > > Our internal experiments demonstrated dramatic positive
> > impact
> > > > on the
> > > > > > > > > performance of our systems.  As a result, the DataSketches
> > > > library
> > > > > > > quickly
> > > > > > > > > evolved to include different types of sketches for
> different
> > > > types of
> > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > algorithms,
> > > > > > > > > quantile/histogram algorithms, and weighted and unweighted
> > > > sampling
> > > > > > > > > algorithms.
> > > > > > > > >
> > > > > > > > > We quickly discovered that developing these sketch
> algorithms
> > > to
> > > > be
> > > > > > > truly
> > > > > > > > > robust in production environments is quite difficult and
> > > requires
> > > > > > deep
> > > > > > > > > understanding of the underlying mathematics and statistics
> as
> > > > well as
> > > > > > > > > extensive experience in developing high quality code for
> 24/7
> > > > > > > production
> > > > > > > > > systems. This is a difficult combination of skills for any
> > one
> > > > > > > organization
> > > > > > > > > to collect and maintain over time. It became clear that
> this
> > > > > > technology
> > > > > > > > > needed a community larger than Yahoo to evolve.  In
> November,
> > > > 2015,
> > > > > > > this
> > > > > > > > > factor, along with Yahoo’s strong experience and support of
> > > open
> > > > > > > source,
> > > > > > > > > led to the decision to open source this technology under an
> > > > Apache
> > > > > > 2.0
> > > > > > > > > license on GitHub. Since that time our community has
> expanded
> > > > > > > considerably
> > > > > > > > > and the key contributors to this effort includes leading
> > > research
> > > > > > > > > scientists from a number of universities as well as
> > > > practitioners and
> > > > > > > > > researchers from a number of major corporations. The core
> of
> > > this
> > > > > > > group is
> > > > > > > > > very active as we meet weekly to discuss research
> directions
> > > and
> > > > > > > > > engineering priorities.
> > > > > > > > >
> > > > > > > > > It is important to note that our internal systems at Yahoo
> > use
> > > > the
> > > > > > > current
> > > > > > > > > public GitHub open source DataSketches library and not an
> > > > internal
> > > > > > > version
> > > > > > > > > of the code.
> > > > > > > > >
> > > > > > > > > The close collaboration of scientific research and
> > engineering
> > > > > > > development
> > > > > > > > > experience with actual massive-data processing systems has
> > also
> > > > > > > produced
> > > > > > > > > new research publications in the field of stochastic
> > streaming
> > > > > > > algorithms,
> > > > > > > > > for example:
> > > > > > > > >
> > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty,
> > Lee
> > > > > > > Rhodes, and
> > > > > > > > > Justin Thaler. A high-performance algorithm for identifying
> > > > frequent
> > > > > > > items
> > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > >
> > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > > Thaler. A
> > > > > > > > > framework for estimating stream expression cardinalities.
> In
> > > > > > *EDBT/ICDT
> > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > >
> > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > > Frequent
> > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > Proceedings
> > > > > > > ‘16,
> > > > > > > > > pages 845-854, 2016.
> > > > > > > > >
> > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> > > > quantile
> > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> pages
> > > > 71–78,
> > > > > > > 2016.
> > > > > > > > >
> > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > optimal
> > > > > > > cardinality
> > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > 2017.
> > > > > > > > >
> > > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching.
> In
> > > ACM
> > > > KDD
> > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > >
> > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > Jonathan
> > > > > > > Ullman.
> > > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> > PODS
> > > > > > > Proceedings
> > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > >
> > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > > > > Hierarchical
> > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> ALENEX
> > > > > > > Proceedings
> > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > >
> > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > >
> > > > > > > > > In the analysis of big data there are often problem queries
> > > that
> > > > > > don’t
> > > > > > > > > scale because they require huge compute resources and time
> to
> > > > > > generate
> > > > > > > > > exact results. Examples include count distinct, quantiles,
> > most
> > > > > > > frequent
> > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > >
> > > > > > > > > If we can loosen the requirement of “exact” results from
> our
> > > > queries
> > > > > > > and be
> > > > > > > > > satisfied with approximate results, within some well
> > understood
> > > > > > bounds
> > > > > > > of
> > > > > > > > > error, there is an entire branch of mathematics and data
> > > science
> > > > that
> > > > > > > has
> > > > > > > > > evolved around developing algorithms that can produce
> > > approximate
> > > > > > > results
> > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > >
> > > > > > > > > With the additional requirements that these algorithms must
> > be
> > > > small
> > > > > > > > > (compared to the size of the input data), sublinear (the
> size
> > > of
> > > > the
> > > > > > > sketch
> > > > > > > > > must grow at a slower rate than the size of the input
> > stream),
> > > > > > > streaming
> > > > > > > > > (they can only touch each data item once), and mergeable
> > > > (suitable
> > > > > > for
> > > > > > > > > distributed processing), defines a class of algorithms that
> > can
> > > > be
> > > > > > > > > described as small, stochastic, streaming, sublinear
> > mergeable
> > > > > > > algorithms,
> > > > > > > > > commonly called sketches (they also have other names, but
> we
> > > > will use
> > > > > > > the
> > > > > > > > > term sketches from here on).
> > > > > > > > >
> > > > > > > > > To be truly streaming and be able to process data in a
> single
> > > > pass,
> > > > > > > > > sketches must make absolute minimum assumptions about the
> > input
> > > > > > stream.
> > > > > > > > > This is critically important, as there is no “second
> chance”
> > to
> > > > > > > process the
> > > > > > > > > data.
> > > > > > > > >
> > > > > > > > > For example, sketches should not make assumptions about the
> > > > order of
> > > > > > > stream
> > > > > > > > > items, the stream length, the dynamic range of values, or
> the
> > > > > > > distribution
> > > > > > > > > of item occurrence frequencies. Sketches should be tolerant
> > of
> > > > NaNs,
> > > > > > > Nulls
> > > > > > > > > and empty objects. About the only thing that the sketch
> needs
> > > to
> > > > know
> > > > > > > about
> > > > > > > > > the stream is how to extract items from it and what type
> the
> > > > item is,
> > > > > > > e.g.,
> > > > > > > > > is it a numeric value or a string.
> > > > > > > > >
> > > > > > > > > As far as the sketch is concerned, the input stream is a
> > > > sequence of
> > > > > > > items
> > > > > > > > > in some unknown random order with unknown random values.
> > > > > > > > >
> > > > > > > > > The sketch is essentially a complex state machine and
> > combined
> > > > with
> > > > > > the
> > > > > > > > > random input stream defines a stochastic process. We then
> > apply
> > > > > > > > > probabilistic methods to interpret the states of the
> > stochastic
> > > > > > > process in
> > > > > > > > > order to extract useful information about the input stream
> > > > itself.
> > > > > > The
> > > > > > > > > resulting information will be approximate, but we also use
> > > > additional
> > > > > > > > > probabilistic methods to extract an estimate of the likely
> > > > > > probability
> > > > > > > > > distribution of error.
> > > > > > > > >
> > > > > > > > > There is a significant scientific contribution here that is
> > > > defining
> > > > > > > the
> > > > > > > > > state machine, understanding the resulting stochastic
> > process,
> > > > > > > developing
> > > > > > > > > the probabilistic methods, and proving mathematically, that
> > it
> > > > all
> > > > > > > works!
> > > > > > > > > This is why the scientific contributors to this project
> are a
> > > > > > critical
> > > > > > > and
> > > > > > > > > strategic component to our success.  The development
> > engineers
> > > > > > > translate
> > > > > > > > > the concepts of the proposed state machine and
> probabilistic
> > > > methods
> > > > > > > into
> > > > > > > > > production-quality code. Even more important, they work
> > closely
> > > > with
> > > > > > > the
> > > > > > > > > scientists, feeding back system and user requirements,
> which
> > > > leads
> > > > > > not
> > > > > > > only
> > > > > > > > > to superior product design, but to new science as well.  A
> > > > number of
> > > > > > > > > scientific papers our members have published (see above)
> is a
> > > > direct
> > > > > > > result
> > > > > > > > > of this close collaboration.
> > > > > > > > >
> > > > > > > > > Because sketches are small they can be processed extremely
> > > fast,
> > > > > > often
> > > > > > > many
> > > > > > > > > orders-of-magnitude faster than traditional exact
> > computations.
> > > > For
> > > > > > > > > interactive queries there may not be other viable
> > alternatives,
> > > > and
> > > > > > in
> > > > > > > the
> > > > > > > > > case of real-time analysis, sketches are the only known
> > > solution.
> > > > > > > > >
> > > > > > > > > For any system that needs to extract useful information
> from
> > > > massive
> > > > > > > data
> > > > > > > > > sketches are essential tools that should be tightly
> > integrated
> > > > into
> > > > > > the
> > > > > > > > > system’s analysis capabilities. This technology has helped
> > > Yahoo
> > > > > > > > > successfully reduce data processing times from days to
> hours
> > or
> > > > > > > minutes on
> > > > > > > > > a number of its internal platforms and has enabled
> subsecond
> > > > queries
> > > > > > on
> > > > > > > > > real-time platforms that would have been infeasible without
> > > > sketches.
> > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > Other open source implementations of sketch algorithms can
> be
> > > > found
> > > > > > on
> > > > > > > the
> > > > > > > > > Internet. However, we have not yet found any open source
> > > > > > > implementations
> > > > > > > > > that are as comprehensive, engineered with the quality
> > required
> > > > for
> > > > > > > > > production systems, and with usable and guaranteed error
> > > > properties.
> > > > > > > Large
> > > > > > > > > Internet companies, such as Google and Facebook, have
> > published
> > > > > > papers
> > > > > > > on
> > > > > > > > > sketching, however, their implementations of their
> published
> > > > > > > algorithms are
> > > > > > > > > proprietary and not available as open source.
> > > > > > > > >
> > > > > > > > > The DataSketches library already provides integrations
> with a
> > > > number
> > > > > > of
> > > > > > > > > major Apache data processing platforms such as Apache Hive,
> > > > Apache
> > > > > > Pig,
> > > > > > > > > Apache Spark and Apache Druid, and is also integrated with
> a
> > > > number
> > > > > > of
> > > > > > > > > other open source data processing platforms such as Splice
> > > > Machine,
> > > > > > > GCHQ
> > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > >
> > > > > > > > > We believe that having DataSketches as an Apache project
> will
> > > > provide
> > > > > > > an
> > > > > > > > > immediate, worthwhile, and substantial contribution to the
> > open
> > > > > > source
> > > > > > > > > community, will have a better opportunity to provide a
> > > meaningful
> > > > > > > > > contribution to both the science and engineering of
> sketching
> > > > > > > algorithms,
> > > > > > > > > and integrate with other Apache projects.  In addition,
> this
> > > is a
> > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > destination
> > > > for
> > > > > > > users
> > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > >
> > > > > > > > > == Initial Goals ==
> > > > > > > > >
> > > > > > > > > We are breaking our initial goals into short-term (2-6
> > months)
> > > > and
> > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > >
> > > > > > > > > Our short-term goals include:
> > > > > > > > >
> > > > > > > > > * Understanding and adapting to the Apache development
> > process
> > > > and
> > > > > > > > > structures.
> > > > > > > > >
> > > > > > > > > * Start refactoring codebase and move various DataSketches
> > > > > > repositories
> > > > > > > > > code to Apache Git repository.
> > > > > > > > >
> > > > > > > > > * Continue development of new features, functions, and
> fixes.
> > > > > > > > >
> > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> continue
> > to
> > > > be
> > > > > > > > > developed and expanded.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The intermediate to long term goals include:
> > > > > > > > >
> > > > > > > > > * Completing the design and implementation of the C++
> > sketches
> > > to
> > > > > > > > > complement what is already available in Java, and the
> Python
> > > > wrappers
> > > > > > > of
> > > > > > > > > those C++ sketches.
> > > > > > > > >
> > > > > > > > > * Expanding the C++ build framework to include Windows and
> > the
> > > > > > popular
> > > > > > > > > Linux variants.
> > > > > > > > >
> > > > > > > > > * Continued engagement with the scientific research
> community
> > > on
> > > > the
> > > > > > > > > development of new algorithms for computationally difficult
> > > > problems
> > > > > > > that
> > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > >
> > > > > > > > > == Current Status ==
> > > > > > > > >
> > > > > > > > > The DataSketches GitHub project has been quite successful.
> > As
> > > of
> > > > > > this
> > > > > > > > > writing (Feb, 2019) the number of downloads measured by the
> > > Nexus
> > > > > > > > > Repository Manager at https://oss.sonatype.org has grown
> by
> > > > nearly a
> > > > > > > > > factor
> > > > > > > > > of 10 over the past year to about 55 thousand per month.
> The
> > > > > > > > > DataSketches/sketches-core repository has about 560 stars
> and
> > > 141
> > > > > > > forks,
> > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > >
> > > > > > > > > === Development Practices ===
> > > > > > > > >
> > > > > > > > > ==== Source Control ====
> > > > > > > > >
> > > > > > > > > All of our developers have extensive experience with Git
> > > version
> > > > > > > control
> > > > > > > > > and follow accepted practices for use of Pull Requests
> (PRs),
> > > > code
> > > > > > > reviews
> > > > > > > > > and commits to master, for example.
> > > > > > > > >
> > > > > > > > > ==== Testing ====
> > > > > > > > >
> > > > > > > > > Sketches, by their nature are probabilistic programs and
> > don’t
> > > > > > > necessarily
> > > > > > > > > behave deterministically.  For some of the sketches we
> > > > intentionally
> > > > > > > insert
> > > > > > > > > random noise into the code as this gives us the
> mathematical
> > > > > > properties
> > > > > > > > > that we need to guarantee accuracy.  This can make the
> > behavior
> > > > of
> > > > > > > these
> > > > > > > > > algorithms quite unintuitive and provides significant
> > > challenges
> > > > to
> > > > > > the
> > > > > > > > > developer who wishes to test these algorithms for
> > correctness.
> > > > As a
> > > > > > > result,
> > > > > > > > > our testing strategy includes two major components: unit
> > tests,
> > > > and
> > > > > > > > > characterization tests.
> > > > > > > > >
> > > > > > > > > ===== Unit Testing =====
> > > > > > > > >
> > > > > > > > > Our unit tests are primarily quick tests to make sure that
> we
> > > > > > exercise
> > > > > > > all
> > > > > > > > > critical paths in the code and that key branches are
> executed
> > > > > > > correctly. It
> > > > > > > > > is important that they execute relatively fast as they are
> > > > generally
> > > > > > > run on
> > > > > > > > > every code build. The sketches-core repository alone has
> > about
> > > 22
> > > > > > > thousand
> > > > > > > > > statements, over 1300 unit tests and code coverage of about
> > > > 98.2% as
> > > > > > > > > measured by Atlassian/Clover.  It is our goal for all of
> our
> > > code
> > > > > > > > > repositories that are used in production that they have
> code
> > > > coverage
> > > > > > > > > greater than 90%.
> > > > > > > > >
> > > > > > > > > ===== Characterization Testing =====
> > > > > > > > >
> > > > > > > > > In order to test the probabilistic methods that are used to
> > > > interpret
> > > > > > > the
> > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > characterization
> > > > > > > > > repository that is dedicated to this.  To measure accuracy,
> > for
> > > > > > > example,
> > > > > > > > > requires running thousands of trials at each of many
> > different
> > > > points
> > > > > > > along
> > > > > > > > > the domain axis. Each trial compares its estimated results
> > > > against a
> > > > > > > known
> > > > > > > > > exact result producing an error for that trial.  These
> error
> > > > > > > measurements
> > > > > > > > > are then fed into our Quantiles sketch to capture the
> actual
> > > > > > > distribution
> > > > > > > > > of error at that point along the axis. We then select
> > quantile
> > > > > > contours
> > > > > > > > > across all the distributions at points along the axis.
> These
> > > > > > contours
> > > > > > > can
> > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > distribution.
> > > > > > > These
> > > > > > > > > distributions are not at all Gaussian, in fact they can be
> > > quite
> > > > > > > complex.
> > > > > > > > > Nonetheless, these distributions are then checked against
> our
> > > > > > > statistical
> > > > > > > > > guarantees inherent to the specific sketch algorithm and
> its
> > > > > > > parameters.
> > > > > > > > > There are many examples of these characterization error
> > > > distributions
> > > > > > > on
> > > > > > > > > our website. The runtimes of these tests can be very long
> and
> > > can
> > > > > > range
> > > > > > > > > from many minutes to hours, and some can run for days.
> > > > Currently, we
> > > > > > > have
> > > > > > > > > separate characterization repositories for Java and C++ /
> > > Python.
> > > > > > > > >
> > > > > > > > > It is our goal that we perform this characterization
> analysis
> > > > for all
> > > > > > > of
> > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > characterization
> > > > > > > > > tests is open-source so others can run these tests as well.
> > We
> > > > do
> > > > > > not
> > > > > > > have
> > > > > > > > > formal releases of this code (because it is not production
> > > code)
> > > > and
> > > > > > > it is
> > > > > > > > > not published to Maven Central.
> > > > > > > > >
> > > > > > > > > === Meritocracy ===
> > > > > > > > >
> > > > > > > > > DataSketches was initially developed based on requirements
> > > within
> > > > > > > Yahoo. As
> > > > > > > > > a project on GitHub, DataSketches has received
> contributions
> > > from
> > > > > > > numerous
> > > > > > > > > individual developers from around the world, dedicated
> > research
> > > > work
> > > > > > > from
> > > > > > > > > senior scientists at Amazon and Visa, and academic
> > researchers
> > > > from
> > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > >
> > > > > > > > > As a project under incubation, we are committed to
> expanding
> > > our
> > > > > > > effort to
> > > > > > > > > build an environment which supports a meritocracy. We are
> > > > focused on
> > > > > > > > > engaging the community and other related projects for
> support
> > > and
> > > > > > > > > contributions. Moreover, we are committed to ensure
> > > contributors
> > > > and
> > > > > > > > > committers to DataSketches come from a broad mix of
> > > organizations
> > > > > > > through a
> > > > > > > > > merit-based decision process during incubation. We believe
> > > > strongly
> > > > > > in
> > > > > > > the
> > > > > > > > > DataSketches premise that fulfills the concept of a well
> > > > engineered
> > > > > > and
> > > > > > > > > scientifically rigorous library that implements these
> > powerful
> > > > > > > algorithms
> > > > > > > > > and are committed to growing an inclusive community of
> > > > DataSketches
> > > > > > > > > contributors and users.
> > > > > > > > >
> > > > > > > > > === Community ===
> > > > > > > > >
> > > > > > > > > Yahoo has a long history and active engagement in the Open
> > > Source
> > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> Moloch,
> > > > > > Panoptes,
> > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > TensorFlowOnSpark,
> > > > > > > gifshot,
> > > > > > > > > fluxible, as well as the creation, contribution and
> > incubation
> > > of
> > > > > > many
> > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > Oozie,
> > > > > > > Zookeeper,
> > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > > > > >
> > > > > > > > > Every day, DataSketches is actively used by a organizations
> > and
> > > > > > > > > institutions around the world for batch and stream
> processing
> > > of
> > > > > > data.
> > > > > > > We
> > > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > > DataSketches-related work, grow the DataSketches community,
> > and
> > > > > > deepen
> > > > > > > > > connections between DataSketches and other open source
> > > projects.
> > > > > > > > >
> > > > > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > > > > >
> > > > > > > > > The core developers and contributors for DataSketches are
> > from
> > > > > > diverse
> > > > > > > > > backgrounds, but primarily are scientists that love
> > engineering
> > > > and
> > > > > > > > > engineers that love science. A large part of the value we
> > bring
> > > > comes
> > > > > > > from
> > > > > > > > > this synthesis.  These individuals have already contributed
> > > > > > > substantially
> > > > > > > > > to the code, algorithms, and/or mathematical proofs that
> form
> > > the
> > > > > > > basis of
> > > > > > > > > the library.
> > > > > > > > >
> > > > > > > > > This core group also form the Initial Committers with write
> > > > > > > permissions to
> > > > > > > > > the repository. Those marked with (*) Meet weekly to plan
> the
> > > > > > research
> > > > > > > and
> > > > > > > > > engineering direction of the project.
> > > > > > > > >
> > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > >
> > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > Israel.
> > > > > > > Interests:
> > > > > > > > > distributed systems, scalable systems and platforms for big
> > > data
> > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > >
> > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> > Labs,
> > > > > > > Sunnyvale,
> > > > > > > > > California. Interests: algorithms, theoretical and applied
> > > > > > mathematics,
> > > > > > > > > encoding and compression theory, theoretical and applied
> > > > performance
> > > > > > > > > optimization.
> > > > > > > > >
> > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> > > Labs,
> > > > Palo
> > > > > > > Alto,
> > > > > > > > > California. Manages the algorithms group at Amazon AI. We
> > build
> > > > > > > scalable
> > > > > > > > > machine learning systems and algorithms which are used both
> > > > > > internally
> > > > > > > and
> > > > > > > > > externally by customers of SageMaker, AWS's flagship
> machine
> > > > learning
> > > > > > > > > platform.
> > > > > > > > >
> > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > > > Interests:
> > > > > > > > > Computational advertising, machine learning, speech
> > > recognition,
> > > > > > > > > data-driven analysis, large scale experimentation, big
> data,
> > > > > > > stream/complex
> > > > > > > > > event processing
> > > > > > > > >
> > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > Computer
> > > > > > > Science,
> > > > > > > > > Georgetown University, Washington D.C. Interests:
> algorithms
> > > and
> > > > > > > > > computational complexity, complexity theory, quantum
> > > algorithms,
> > > > > > > private
> > > > > > > > > data analysis, and learning theory, developing efficient
> > > > streaming
> > > > > > and
> > > > > > > > > sketching algorithms
> > > > > > > > >
> > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > >
> > > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> > > Snap.
> > > > > > > Interests:
> > > > > > > > > design and implementation of data storing and data
> processing
> > > > > > > (distributed)
> > > > > > > > > systems, performance optimization, CPU performance,
> > mechanical
> > > > > > > sympathy,
> > > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > > structures,
> > > > > > > > > memory management, garbage collection algorithms, language
> > > > design and
> > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > efficiency,
> > > > > > > Linux,
> > > > > > > > > code quality, code transformation, pure functional
> > programming
> > > > > > models,
> > > > > > > > > Haskell.
> > > > > > > > >
> > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer
> and
> > > > founder
> > > > > > > of
> > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > Interests:
> > > > > > > > > streaming algorithms, mathematics, computer science, high
> > > > quality and
> > > > > > > high
> > > > > > > > > performance code for the analysis of massive data, bridging
> > the
> > > > > > divide
> > > > > > > > > between theory and practice.
> > > > > > > > >
> > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > > > Sunnyvale,
> > > > > > > > > California. Interests: applied mathematics, computer
> science,
> > > big
> > > > > > data,
> > > > > > > > > distributed systems.
> > > > > > > > >
> > > > > > > > > === Introduction to Additional Interested Contributors ===
> > > > > > > > >
> > > > > > > > > These folks have been intermittently involved and
> > contributed,
> > > > but
> > > > > > are
> > > > > > > > > strong supporters of this project.
> > > > > > > > >
> > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > >
> > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > Computer
> > > > > > > Science,
> > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > matrix
> > > > > > > > > approximation, streaming algorithms, randomized linear
> > algebra.
> > > > > > > > >
> > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> > Ph.D.
> > > > > > > Computer
> > > > > > > > > Science, Research Instructor, Princeton University.
> > Interests:
> > > > > > > algorithmic
> > > > > > > > > foundations of data science and machine learning, efficient
> > > > methods
> > > > > > for
> > > > > > > > > processing and understanding large datasets, often working
> at
> > > the
> > > > > > > > > intersection of theoretical computer science, numerical
> > linear
> > > > > > > algebra, and
> > > > > > > > > optimization.
> > > > > > > > >
> > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > > Computer
> > > > > > > Science,
> > > > > > > > > Professor, Warwick University, Warwick, England. Interests:
> > all
> > > > > > > aspects of
> > > > > > > > > the "data lifecycle", from data collection and cleaning,
> > > through
> > > > > > > mining and
> > > > > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > > > > scientists
> > > > > > > in
> > > > > > > > > sketching algorithms)
> > > > > > > > >
> > > > > > > > > === Alignment ===
> > > > > > > > >
> > > > > > > > > The DataSketches library already provides integrations and
> > > > example
> > > > > > > code for
> > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > integrated
> > > > into
> > > > > > > Apache
> > > > > > > > > Druid.
> > > > > > > > >
> > > > > > > > > == Known Risks ==
> > > > > > > > >
> > > > > > > > > The following subsections are specific risks that have been
> > > > > > identified
> > > > > > > by
> > > > > > > > > the ASF that need to be addressed.
> > > > > > > > >
> > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > >
> > > > > > > > > The DataSketches library is presently used by a number of
> > > > > > > organizations,
> > > > > > > > > from small startups to Fortune 100 companies, to construct
> > > > production
> > > > > > > > > pipelines that must process and analyze massive data. Yahoo
> > > has a
> > > > > > > long-term
> > > > > > > > > commitment to continue to advance the DataSketches library;
> > > > moreover,
> > > > > > > > > DataSketches is seeing increasing interest, development,
> and
> > > > adoption
> > > > > > > from
> > > > > > > > > many diverse organizations from around the world. Due to
> its
> > > > growing
> > > > > > > > > adoption, we feel it is quite unlikely that this project
> > would
> > > > become
> > > > > > > > > orphaned.
> > > > > > > > >
> > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > >
> > > > > > > > > Yahoo believes strongly in open source and the exchange of
> > > > > > information
> > > > > > > to
> > > > > > > > > advance new ideas and work. Examples of this commitment are
> > > > active
> > > > > > open
> > > > > > > > > source projects such as those mentioned above. With
> > > > DataSketches, we
> > > > > > > have
> > > > > > > > > been increasingly open and forward-looking; we have
> > published a
> > > > > > number
> > > > > > > of
> > > > > > > > > papers about breakthrough developments in the science of
> > > > streaming
> > > > > > > > > algorithms (mentioned above) that also reference the
> > > DataSketches
> > > > > > > library.
> > > > > > > > > Our submission to the Apache Software Foundation is a
> logical
> > > > > > > extension of
> > > > > > > > > our commitment to open source software.
> > > > > > > > >
> > > > > > > > > Key committers at Yahoo with strong open source backgrounds
> > > > include
> > > > > > > Aaron
> > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > Braginsky,
> > > > > > Andrews
> > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> > > Call,
> > > > > > Daryn
> > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> > Eshcar
> > > > > > Hillel,
> > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > Perez-Sorrosal,
> > > > > > > Gil
> > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> > > James
> > > > > > > Penick,
> > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > > Eagles,
> > > > > > > Kihwal
> > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > > > Trelinski,
> > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > > > > Natkovich,
> > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> > Ruby
> > > > Loo,
> > > > > > > Ryan
> > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu
> Kit
> > > > Chan,
> > > > > > Sri
> > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> > > more.
> > > > > > > > >
> > > > > > > > > All of our core developers are committed to learn about the
> > > > Apache
> > > > > > > process
> > > > > > > > > and to give back to the community.
> > > > > > > > >
> > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > >
> > > > > > > > > The majority of committers in this proposal belong to Yahoo
> > due
> > > > to
> > > > > > the
> > > > > > > fact
> > > > > > > > > that DataSketches has emerged from an internal Yahoo
> project.
> > > > This
> > > > > > > proposal
> > > > > > > > > also includes developers and contributors from other
> > companies,
> > > > and
> > > > > > > who are
> > > > > > > > > actively involved with other Apache projects, such as
> Druid.
> > > We
> > > > > > > expect our
> > > > > > > > > entry into incubation will allow us to expand the number of
> > > > > > > individuals and
> > > > > > > > > organizations participating in DataSketches development.
> > > > > > > > >
> > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > >
> > > > > > > > > Because the DataSketches library originated within Yahoo,
> it
> > > has
> > > > been
> > > > > > > > > developed primarily by salaried Yahoo developers and we
> > expect
> > > > that
> > > > > > to
> > > > > > > > > continue to be the case near term. However, since we placed
> > > this
> > > > > > > library
> > > > > > > > > into open-source we have had a number of significant
> > > > contributions
> > > > > > from
> > > > > > > > > engineers and scientists from outside of Yahoo. We expect
> our
> > > > > > reliance
> > > > > > > on
> > > > > > > > > Yahoo salaried developers will decrease over time.
> > Nonetheless,
> > > > Yahoo
> > > > > > > is
> > > > > > > > > committed to continue its strong support of this important
> > > > project.
> > > > > > > > >
> > > > > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > > > > >
> > > > > > > > > DataSketches already directly interoperates with or
> utilizes
> > > > several
> > > > > > > > > existing Apache projects.
> > > > > > > > >
> > > > > > > > > * Build
> > > > > > > > >    * Apache Maven
> > > > > > > > >
> > > > > > > > > * Integrations and adaptors for the following projects
> > > naturally
> > > > have
> > > > > > > them
> > > > > > > > > as dependencies
> > > > > > > > >    * Apache Hive
> > > > > > > > >    * Apache Pig
> > > > > > > > >    * Apache Druid
> > > > > > > > >    * Apache Spark
> > > > > > > > >
> > > > > > > > > * Additional dependencies for the above integrations and
> > > adaptors
> > > > > > > include
> > > > > > > > >    * Apache Hadoop
> > > > > > > > >    * Apache Commons (Math)
> > > > > > > > >
> > > > > > > > > There is no other Apache project that we are aware of that
> > > > duplicates
> > > > > > > the
> > > > > > > > > functionality of the DataSketches library.
> > > > > > > > >
> > > > > > > > > === Risk: An Excessive Fascination with the Apache Brand
> ===
> > > > > > > > >
> > > > > > > > > With this proposal we are not seeking attention or
> publicity.
> > > > Rather,
> > > > > > > we
> > > > > > > > > firmly believe in the DataSketches library and concept and
> > the
> > > > > > ability
> > > > > > > to
> > > > > > > > > make the DataSketches library a powerful, yet simple-to-use
> > > > toolkit
> > > > > > for
> > > > > > > > > data processing. While the DataSketches library has been
> open
> > > > source,
> > > > > > > we
> > > > > > > > > believe putting code on GitHub can only go so far. We see
> the
> > > > Apache
> > > > > > > > > community, processes, and mission as critical for ensuring
> > the
> > > > > > > DataSketches
> > > > > > > > > library is truly community-driven, positively impactful,
> and
> > > > > > innovative
> > > > > > > > > open source software. While Yahoo has taken a number of
> steps
> > > to
> > > > > > > advance
> > > > > > > > > its various open source projects, we believe the
> DataSketches
> > > > library
> > > > > > > > > project is a great fit for the Apache Software Foundation
> due
> > > to
> > > > its
> > > > > > > focus
> > > > > > > > > on data processing and its relationships to existing ASF
> > > > projects.
> > > > > > > > >
> > > > > > > > > === Risk: Cryptography ===
> > > > > > > > >
> > > > > > > > > DataSketches does not contain any cryptographic code and is
> > > not a
> > > > > > > > > cryptographic product.
> > > > > > > > >
> > > > > > > > > == Documentation ==
> > > > > > > > >
> > > > > > > > > The following documentation is relevant to this proposal.
> > > > Relevant
> > > > > > > portions
> > > > > > > > > of the documentation will be contributed to the Apache
> > > > DataSketches
> > > > > > > > > project.
> > > > > > > > >
> > > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > > >
> > > > > > > > > * DataSketches website repository:
> > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > >
> > > > > > > > > We will need an apache website for this documentation
> similar
> > > to
> > > > > > > > >
> > > > > > > > > * https://datasketches.apache.org
> > > > > > > > >
> > > > > > > > > == Initial Source ==
> > > > > > > > >
> > > > > > > > > The initial source for DataSketches which we will submit to
> > the
> > > > > > Apache
> > > > > > > > > Foundation will include a number of repositories which are
> > > > currently
> > > > > > > hosted
> > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > >
> > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > >
> > > > > > > > > * Java
> > > > > > > > >    * sketches-core: This repository has the core sketching
> > > > classes,
> > > > > > > which
> > > > > > > > > are leveraged by some of the other repositories. This
> > > repository
> > > > has
> > > > > > no
> > > > > > > > > external dependencies outside of the DataSketches/memory
> > > > repository,
> > > > > > > Java
> > > > > > > > > and TestNG for unit tests. This code is versioned and the
> > > latest
> > > > > > > release
> > > > > > > > > can be obtained from Maven Central.
> > > > > > > > >    * memory: Low level, high-performance memory
> > data-structure
> > > > > > > management
> > > > > > > > > primarily for off-heap.
> > > > > > > > >    * sketches-android: This is a new repository dedicated
> to
> > > > sketches
> > > > > > > > > designed to be run in a mobile client, such as a cell
> phone.
> > It
> > > > is
> > > > > > > still in
> > > > > > > > > development and should be considered experimental.
> > > > > > > > >    * sketches-hive: This repository contains Hive UDFs and
> > > UDAFs
> > > > for
> > > > > > > use
> > > > > > > > > within Hadoop grid environments. This code has dependencies
> > on
> > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> code
> > > are
> > > > > > > advised to
> > > > > > > > > use Maven to bring in all the required dependencies. This
> > code
> > > is
> > > > > > > versioned
> > > > > > > > > and the latest release can be obtained from Maven Central.
> > > > > > > > >    * sketches-pig: This repository contains Pig User
> Defined
> > > > > > Functions
> > > > > > > > > (UDF) for use within Hadoop grid environments. This code
> has
> > > > > > > dependencies
> > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> > code
> > > > are
> > > > > > > advised
> > > > > > > > > to use Maven to bring in all the required dependencies.
> This
> > > > code is
> > > > > > > > > versioned and the latest release can be obtained from Maven
> > > > Central.
> > > > > > > > >    * sketches-vector: This is a new repository dedicated to
> > > > sketches
> > > > > > > for
> > > > > > > > > vector and matrix operations. It is still somewhat
> > > experimental.
> > > > > > > > >    * characterization: This relatively new repository is
> for
> > > code
> > > > > > that
> > > > > > > we
> > > > > > > > > use to characterize the accuracy and speed performance of
> the
> > > > > > sketches
> > > > > > > in
> > > > > > > > > the library and is constantly being updated. Examples of
> the
> > > job
> > > > > > > command
> > > > > > > > > files used for various tests can be found in the
> > > > src/main/resources
> > > > > > > > > directory. Some of these tests can run for hours depending
> on
> > > its
> > > > > > > > > configuration.
> > > > > > > > >    * experimental: This repository is an experimental
> staging
> > > > area
> > > > > > for
> > > > > > > code
> > > > > > > > > that will eventually end up in another repository. This
> code
> > is
> > > > not
> > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > production
> > > > > > > > > deployment
> > > > > > > > >
> > > > > > > > > * C++ and Python
> > > > > > > > >    * sketches-core-cpp: This is the C++/Python companion to
> > the
> > > > Java
> > > > > > > > > sketches-core. These implementations are binary compatible
> > with
> > > > their
> > > > > > > > > counterparts in Java. In other words, a sketch created and
> > > > stored in
> > > > > > > C++
> > > > > > > > > can be opened and read in Java and visa-versa. This site
> also
> > > > has our
> > > > > > > > > Python adaptors that basically wrap the C++
> implementations,
> > > > making
> > > > > > the
> > > > > > > > > high performance C++ implementations available from Python.
> > > > > > > > >    * sketches-postgres: This site provides the
> > > postgres-specific
> > > > > > > adaptors
> > > > > > > > > that wrap the C++ implementations making them available to
> > the
> > > > > > Postgres
> > > > > > > > > database users.
> > > > > > > > >    * characterization-cpp: This is the C++/Python companion
> > to
> > > > the
> > > > > > Java
> > > > > > > > > characterization repository.
> > > > > > > > >    * experimental-cpp: This repository is an experimental
> > > staging
> > > > > > area
> > > > > > > for
> > > > > > > > > C++ code that will eventually end up in another repository.
> > > > > > > > >
> > > > > > > > > * Command-Line Tools
> > > > > > > > >    * sketches-cmd
> > > > > > > > >    * homebrew-sketches
> > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > >
> > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > intend
> > > to
> > > > > > > bundle
> > > > > > > > > all of these repositories since they are all complementary
> > and
> > > > should
> > > > > > > be
> > > > > > > > > maintained in one project. Prior to our submission, we will
> > > > combine
> > > > > > > all of
> > > > > > > > > these projects into a new git repository.
> > > > > > > > >
> > > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > > >
> > > > > > > > > Contributors to the DataSketches project have also signed
> the
> > > > Yahoo
> > > > > > > > > Individual Contributor License Agreement (
> > > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > > in order to contribute to the project.
> > > > > > > > >
> > > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > > trademark on
> > > > > > > the
> > > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > > receive
> > > > > > > during the
> > > > > > > > > incubation process, we are open to renaming the project if
> > > > necessary
> > > > > > > for
> > > > > > > > > trademark or other concerns, but we would prefer not to
> have
> > to
> > > > do
> > > > > > > that.
> > > > > > > > >
> > > > > > > > > == External Dependencies ==
> > > > > > > > >
> > > > > > > > > All external dependencies are licensed under an Apache 2.0
> or
> > > > > > > > > Apache-compatible license. As we grow the DataSketches
> > > community
> > > > we
> > > > > > > will
> > > > > > > > > configure our build process to require and validate all
> > > > contributions
> > > > > > > and
> > > > > > > > > dependencies are licensed under the Apache 2.0 license or
> are
> > > > under
> > > > > > an
> > > > > > > > > Apache-compatible license.
> > > > > > > > >
> > > > > > > > > == Required Resources ==
> > > > > > > > >
> > > > > > > > > === Mailing Lists ===
> > > > > > > > >
> > > > > > > > > We currently use a mix of mailing lists. We will migrate
> our
> > > > existing
> > > > > > > > > mailing lists to the following:
> > > > > > > > >
> > > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > === Source Control ===
> > > > > > > > >
> > > > > > > > > The DataSketches team currently uses Git and would like to
> > > > continue
> > > > > > to
> > > > > > > do
> > > > > > > > > so. We request a Git repository for DataSketches with
> > mirroring
> > > > to
> > > > > > > GitHub
> > > > > > > > > enabled similar the following:
> > > > > > > > >
> > > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > > >
> > > > > > > > > === Issue Tracking ===
> > > > > > > > >
> > > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > > DataSketches
> > > > > > > project
> > > > > > > > > is currently using the public GitHub issue tracker and the
> > > public
> > > > > > > Google
> > > > > > > > > Groups forum/sketches-user for issue tracking and
> > discussions.
> > > We
> > > > > > will
> > > > > > > > > migrate and combine from these two sources to the Apache
> > JIRA.
> > > > > > > > >
> > > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > > >
> > > > > > > > > == Initial Committers ==
> > > > > > > > >
> > > > > > > > > The following list of individuals have been extremely
> active
> > in
> > > > our
> > > > > > > > > community and should have write (commit) permissions to the
> > > > > > repository.
> > > > > > > > >
> > > > > > > > > * Eshcar Hillel                      [eshcar at
> verizonmedia
> > > dot
> > > > com]
> > > > > > > > >
> > > > > > > > > * Kevin Lang                    [langk at verizonmedia dot
> > com]
> > > > > > > > >
> > > > > > > > > * Roman Leventov              [roman.leventov at
> > c.metamarkets
> > > > dot
> > > > > > com]
> > > > > > > > >
> > > > > > > > > * Edo Liberty                   [libertye at amazon dot
> com]
> > > > > > > > >
> > > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia
> dot
> > > com]
> > > > > > > > >
> > > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> > > com] &
> > > > > > > [leerho
> > > > > > > > > at gmail dot com]
> > > > > > > > >
> > > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot
> > com]
> > > > > > > > >
> > > > > > > > > * Justin Thaler                 [justin.thaler at
> georgetown
> > > dot
> > > > edu]
> > > > > > > > >
> > > > > > > > > == Affiliations ==
> > > > > > > > >
> > > > > > > > > The initial committers are from four organizations: Yahoo,
> > > > Amazon,
> > > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > > >
> > > > > > > > > === Champion ===
> > > > > > > > > (Recommended to me: )
> > > > > > > > >
> > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > [chenliang613
> > > at
> > > > > > > apache
> > > > > > > > > dot org]
> > > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > > >
> > > > > > > > > === Nominated Mentors ===
> > > > > > > > > (Recommended to me: )
> > > > > > > > >
> > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > [chenliang613
> > > at
> > > > > > > apache
> > > > > > > > > dot org]
> > > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > > >
> > > > > > > > > === Sponsoring Entity ===
> > > > > > > > >
> > > > > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > > > > >
> > > > > > > > > * Apache Druid. The incubating Apache Druid project might
> > also
> > > > be a
> > > > > > > logical
> > > > > > > > > sponsor. However, DataSketches has applications in many
> areas
> > > of
> > > > > > > computing
> > > > > > > > > outside of Druid so our preference and recommendation is
> that
> > > > > > > DataSketches
> > > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > > >
> > > > > > > > > ________________
> > > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > > previously
> > > > > > > acquired
> > > > > > > > > AOL. The merged entity was originally called Oath, Inc.,
> but
> > > has
> > > > > > > recently
> > > > > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary
> > of
> > > > > > Verizon,
> > > > > > > > > Inc.  Since Yahoo is the more recognized name, references
> in
> > > this
> > > > > > > document
> > > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > > kenn@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The subject line has me interested already. Follow
> examples
> > > > like
> > > > > > this
> > > > > > > > > > maybe?
> > > > > > > > > >
> > > > > > > > > > 1.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > 2.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > >
> > > > > > > > > > Kenn
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <leerho@gmail.com
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I'll try again ... :)
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > > ted.dunning@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> It didn't make it again
> > > > > > > > > > >>
> > > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <
> leerho@gmail.com>
> > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > > leerho@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > > general-help@incubator.apache.org
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> general-unsubscribe@incubator.apache.org
> > > > > > > For additional commands, e-mail:
> > general-help@incubator.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > --
> > > > > From my cell phone.
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > > >
> > >
> >
> --
> Sent from my Mobile device
>

Re: DataSketches Proposal - Google Docs Link

Posted by Luciano Resende <lu...@gmail.com>.

Should we move the proposal to the incubator wiki then?

On Mon, Feb 25, 2019 at 15:26 leerho <le...@gmail.com> wrote:

> Ken,
> Yahoo does not allow me to create a shared link outside our company, except
> to individual email addresses.  So attempting to share it to the email
> general@incubator.apache.org may not work.  Nonetheless, several
> individuals were able to request access using their individual email
> accounts and I was able to add them.  I will try to add you using
> kenn@apache.org, but if that doesn't work, I may need a gmail or
> equivalent
> account for you.
>
> Lee.
>
>
> On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> > I could not access that document. I suggest you need to turn on link
> > sharing.
> >
> > Kenn
> >
> > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> > wrote:
> >
> > > Try this link:
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > >
> > >
> > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > Yes I will try that tomorrow.
> > > >
> > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org>
> > wrote:
> > > >
> > > > > Can you share the Google doc with the proposal? Per Ted's advice,
> we
> > > can
> > > > > iterate quickly there and move it to the wiki when it becomes a bit
> > > more
> > > > > stable.
> > > > >
> > > > > Kenn
> > > > >
> > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> leerho@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the offer.  i am a neophyte at this process and email
> > > app!   I
> > > > > > could use a lot of help getting this off the ground!  Also, I'm
> not
> > > sure
> > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on
> :)
> > > > > >
> > > > > > Lee.
> > > > > >
> > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > > > > Nice.
> > > > > > >
> > > > > > > I would very much like to help mentor this project, though you
> > > already
> > > > > > have
> > > > > > > a couple good ones.
> > > > > > >
> > > > > > > I concur with incubator as sponsoring entity.
> > > > > > >
> > > > > > > Kenn (VP Apache Beam)
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > I didn't realize that this mail list does not accept PDF
> files,
> > > > > > apparently
> > > > > > > > only text.  So let me try one more time ... :)  Please let me
> > > know if
> > > > > > > > this works!
> > > > > > > >
> > > > > > > >
> > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > >
> > > > > > > > == Abstract ==
> > > > > > > >
> > > > > > > > DataSketches.GitHub.io is an open source, high-performance
> > > library
> > > > > of
> > > > > > > > stochastic streaming algorithms commonly called "sketches" in
> > the
> > > > > data
> > > > > > > > sciences. Sketches are small, stateful programs that process
> > > massive
> > > > > > data
> > > > > > > > as a stream and can provide approximate answers, with
> > > mathematical
> > > > > > > > guarantees, to computationally difficult queries
> > > orders-of-magnitude
> > > > > > faster
> > > > > > > > than traditional, exact methods.
> > > > > > > >
> > > > > > > > This proposal is to move DataSketches to the Apache Software
> > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > intellectual
> > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > officially
> > > > > > known as
> > > > > > > > Apache DataSketches and its evolution and governance would
> come
> > > under
> > > > > > the
> > > > > > > > rules and guidance of the ASF.
> > > > > > > >
> > > > > > > > == Introduction ==
> > > > > > > >
> > > > > > > > The DataSketches library contains carefully crafted
> > > implementations
> > > > > of
> > > > > > > > sketch algorithms that meet rigorous standards of quality and
> > > > > > performance
> > > > > > > > and provide capabilities required for large-scale production
> > > systems
> > > > > > that
> > > > > > > > must process and analyze massive data. The DataSketches core
> > > > > > repository is
> > > > > > > > written in Java with a parallel core repository written in
> C++
> > > that
> > > > > > > > includes Python wrappers. The DataSketches library also
> > includes
> > > > > > special
> > > > > > > > repositories for extending the core library for Apache Hive
> and
> > > > > Apache
> > > > > > Pig.
> > > > > > > > The sketches developed in the different languages share a
> > common
> > > > > binary
> > > > > > > > storage format so that sketches created and stored in Java,
> for
> > > > > > example,
> > > > > > > > can be fully used in C++, and visa versa.  Because the stored
> > > sketch
> > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > images),
> > > they
> > > > > > can
> > > > > > > > be shared across many different systems, languages and
> > platforms.
> > > > > > > >
> > > > > > > > The DataSketches documentation website,
> > > > > https://datasketches.github.io
> > > > > > ,
> > > > > > > > includes general tutorials, a comprehensive research section
> > with
> > > > > > > > references to relevant academic papers, extensive examples
> for
> > > using
> > > > > > the
> > > > > > > > core library directly as well as examples for accessing the
> > > library
> > > > > in
> > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > >
> > > > > > > > The DataSketches library also includes a characterization
> > > repository
> > > > > > for
> > > > > > > > long running test programs that are used for studying
> accuracy
> > > and
> > > > > > > > performance of these sketches over wide ranges of input
> > > variables.
> > > > > The
> > > > > > data
> > > > > > > > produced by these programs is used for generating the many
> > > > > performance
> > > > > > > > plots contained in the documentation website and for academic
> > > > > > > > publications.
> > > > > > > >
> > > > > > > > The code repositories used for production are versioned and
> > > published
> > > > > > to
> > > > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > > > >
> > > > > > > > The DataSketches library also includes several experimental
> > > > > > repositories
> > > > > > > > for use-cases outside the large-scale systems environments,
> > such
> > > as
> > > > > > > > sketches for mobile, IoT devices (Android), command-line
> access
> > > of
> > > > > the
> > > > > > > > sketch library, and an experimental repository for
> vector-based
> > > > > > sketches
> > > > > > > > that performs approximate Singular Value Decomposition (SVD)
> > > analysis
> > > > > > that
> > > > > > > > could potentially be used in Machine Learning (ML)
> > applications.
> > > > > > > >
> > > > > > > > == Background ==
> > > > > > > >
> > > > > > > > The DataSketches library was started in 2012 as internal
> Yahoo
> > > > > project
> > > > > > to
> > > > > > > > dramatically reduce time and resources required for distinct
> > > (unique)
> > > > > > > > counting.  An extensive search on the Internet at the time
> > > yielded a
> > > > > > number
> > > > > > > > of theoretical papers on stochastic streaming algorithms with
> > > > > > pseudocode
> > > > > > > > examples, but we did not find any usable open-source code of
> > the
> > > > > > quality we
> > > > > > > > felt we needed for our internal production systems.  So we
> > > started a
> > > > > > small
> > > > > > > > project (one person) to develop our own sketches working
> > directly
> > > > > from
> > > > > > > > published theoretical papers.
> > > > > > > >
> > > > > > > > The DataSketches library was designed from the start with the
> > > > > > objective of
> > > > > > > > making these algorithms, usually only described in
> theoretical
> > > > > papers,
> > > > > > > > easily accessible to systems developers for use in our
> internal
> > > > > > production
> > > > > > > > systems. By necessity, the code had to be of the highest
> > quality
> > > and
> > > > > > > > thoroughly tested. The wide variety of our internal
> production
> > > > > systems
> > > > > > > > drove the requirement that the sketch implementations had to
> > > have an
> > > > > > > > absolute minimum of external, run-time dependencies in order
> to
> > > > > > simplify
> > > > > > > > integration and troubleshooting.
> > > > > > > >
> > > > > > > > Our internal experiments demonstrated dramatic positive
> impact
> > > on the
> > > > > > > > performance of our systems.  As a result, the DataSketches
> > > library
> > > > > > quickly
> > > > > > > > evolved to include different types of sketches for different
> > > types of
> > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > algorithms,
> > > > > > > > quantile/histogram algorithms, and weighted and unweighted
> > > sampling
> > > > > > > > algorithms.
> > > > > > > >
> > > > > > > > We quickly discovered that developing these sketch algorithms
> > to
> > > be
> > > > > > truly
> > > > > > > > robust in production environments is quite difficult and
> > requires
> > > > > deep
> > > > > > > > understanding of the underlying mathematics and statistics as
> > > well as
> > > > > > > > extensive experience in developing high quality code for 24/7
> > > > > > production
> > > > > > > > systems. This is a difficult combination of skills for any
> one
> > > > > > organization
> > > > > > > > to collect and maintain over time. It became clear that this
> > > > > technology
> > > > > > > > needed a community larger than Yahoo to evolve.  In November,
> > > 2015,
> > > > > > this
> > > > > > > > factor, along with Yahoo’s strong experience and support of
> > open
> > > > > > source,
> > > > > > > > led to the decision to open source this technology under an
> > > Apache
> > > > > 2.0
> > > > > > > > license on GitHub. Since that time our community has expanded
> > > > > > considerably
> > > > > > > > and the key contributors to this effort includes leading
> > research
> > > > > > > > scientists from a number of universities as well as
> > > practitioners and
> > > > > > > > researchers from a number of major corporations. The core of
> > this
> > > > > > group is
> > > > > > > > very active as we meet weekly to discuss research directions
> > and
> > > > > > > > engineering priorities.
> > > > > > > >
> > > > > > > > It is important to note that our internal systems at Yahoo
> use
> > > the
> > > > > > current
> > > > > > > > public GitHub open source DataSketches library and not an
> > > internal
> > > > > > version
> > > > > > > > of the code.
> > > > > > > >
> > > > > > > > The close collaboration of scientific research and
> engineering
> > > > > > development
> > > > > > > > experience with actual massive-data processing systems has
> also
> > > > > > produced
> > > > > > > > new research publications in the field of stochastic
> streaming
> > > > > > algorithms,
> > > > > > > > for example:
> > > > > > > >
> > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty,
> Lee
> > > > > > Rhodes, and
> > > > > > > > Justin Thaler. A high-performance algorithm for identifying
> > > frequent
> > > > > > items
> > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > >
> > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > Thaler. A
> > > > > > > > framework for estimating stream expression cardinalities. In
> > > > > *EDBT/ICDT
> > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > >
> > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > Frequent
> > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > Proceedings
> > > > > > ‘16,
> > > > > > > > pages 845-854, 2016.
> > > > > > > >
> > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> > > quantile
> > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> > > 71–78,
> > > > > > 2016.
> > > > > > > >
> > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> optimal
> > > > > > cardinality
> > > > > > > > estimation algorithm. arXiv preprint
> > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > 2017.
> > > > > > > >
> > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In
> > ACM
> > > KDD
> > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > >
> > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > Jonathan
> > > > > > Ullman.
> > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> PODS
> > > > > > Proceedings
> > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > >
> > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > > > Hierarchical
> > > > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > > > > Proceedings
> > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > >
> > > > > > > > == The Rationale for Sketches ==
> > > > > > > >
> > > > > > > > In the analysis of big data there are often problem queries
> > that
> > > > > don’t
> > > > > > > > scale because they require huge compute resources and time to
> > > > > generate
> > > > > > > > exact results. Examples include count distinct, quantiles,
> most
> > > > > > frequent
> > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > >
> > > > > > > > If we can loosen the requirement of “exact” results from our
> > > queries
> > > > > > and be
> > > > > > > > satisfied with approximate results, within some well
> understood
> > > > > bounds
> > > > > > of
> > > > > > > > error, there is an entire branch of mathematics and data
> > science
> > > that
> > > > > > has
> > > > > > > > evolved around developing algorithms that can produce
> > approximate
> > > > > > results
> > > > > > > > with mathematically well-defined error properties.
> > > > > > > >
> > > > > > > > With the additional requirements that these algorithms must
> be
> > > small
> > > > > > > > (compared to the size of the input data), sublinear (the size
> > of
> > > the
> > > > > > sketch
> > > > > > > > must grow at a slower rate than the size of the input
> stream),
> > > > > > streaming
> > > > > > > > (they can only touch each data item once), and mergeable
> > > (suitable
> > > > > for
> > > > > > > > distributed processing), defines a class of algorithms that
> can
> > > be
> > > > > > > > described as small, stochastic, streaming, sublinear
> mergeable
> > > > > > algorithms,
> > > > > > > > commonly called sketches (they also have other names, but we
> > > will use
> > > > > > the
> > > > > > > > term sketches from here on).
> > > > > > > >
> > > > > > > > To be truly streaming and be able to process data in a single
> > > pass,
> > > > > > > > sketches must make absolute minimum assumptions about the
> input
> > > > > stream.
> > > > > > > > This is critically important, as there is no “second chance”
> to
> > > > > > process the
> > > > > > > > data.
> > > > > > > >
> > > > > > > > For example, sketches should not make assumptions about the
> > > order of
> > > > > > stream
> > > > > > > > items, the stream length, the dynamic range of values, or the
> > > > > > distribution
> > > > > > > > of item occurrence frequencies. Sketches should be tolerant
> of
> > > NaNs,
> > > > > > Nulls
> > > > > > > > and empty objects. About the only thing that the sketch needs
> > to
> > > know
> > > > > > about
> > > > > > > > the stream is how to extract items from it and what type the
> > > item is,
> > > > > > e.g.,
> > > > > > > > is it a numeric value or a string.
> > > > > > > >
> > > > > > > > As far as the sketch is concerned, the input stream is a
> > > sequence of
> > > > > > items
> > > > > > > > in some unknown random order with unknown random values.
> > > > > > > >
> > > > > > > > The sketch is essentially a complex state machine and
> combined
> > > with
> > > > > the
> > > > > > > > random input stream defines a stochastic process. We then
> apply
> > > > > > > > probabilistic methods to interpret the states of the
> stochastic
> > > > > > process in
> > > > > > > > order to extract useful information about the input stream
> > > itself.
> > > > > The
> > > > > > > > resulting information will be approximate, but we also use
> > > additional
> > > > > > > > probabilistic methods to extract an estimate of the likely
> > > > > probability
> > > > > > > > distribution of error.
> > > > > > > >
> > > > > > > > There is a significant scientific contribution here that is
> > > defining
> > > > > > the
> > > > > > > > state machine, understanding the resulting stochastic
> process,
> > > > > > developing
> > > > > > > > the probabilistic methods, and proving mathematically, that
> it
> > > all
> > > > > > works!
> > > > > > > > This is why the scientific contributors to this project are a
> > > > > critical
> > > > > > and
> > > > > > > > strategic component to our success.  The development
> engineers
> > > > > > translate
> > > > > > > > the concepts of the proposed state machine and probabilistic
> > > methods
> > > > > > into
> > > > > > > > production-quality code. Even more important, they work
> closely
> > > with
> > > > > > the
> > > > > > > > scientists, feeding back system and user requirements, which
> > > leads
> > > > > not
> > > > > > only
> > > > > > > > to superior product design, but to new science as well.  A
> > > number of
> > > > > > > > scientific papers our members have published (see above) is a
> > > direct
> > > > > > result
> > > > > > > > of this close collaboration.
> > > > > > > >
> > > > > > > > Because sketches are small they can be processed extremely
> > fast,
> > > > > often
> > > > > > many
> > > > > > > > orders-of-magnitude faster than traditional exact
> computations.
> > > For
> > > > > > > > interactive queries there may not be other viable
> alternatives,
> > > and
> > > > > in
> > > > > > the
> > > > > > > > case of real-time analysis, sketches are the only known
> > solution.
> > > > > > > >
> > > > > > > > For any system that needs to extract useful information from
> > > massive
> > > > > > data
> > > > > > > > sketches are essential tools that should be tightly
> integrated
> > > into
> > > > > the
> > > > > > > > system’s analysis capabilities. This technology has helped
> > Yahoo
> > > > > > > > successfully reduce data processing times from days to hours
> or
> > > > > > minutes on
> > > > > > > > a number of its internal platforms and has enabled subsecond
> > > queries
> > > > > on
> > > > > > > > real-time platforms that would have been infeasible without
> > > sketches.
> > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > Other open source implementations of sketch algorithms can be
> > > found
> > > > > on
> > > > > > the
> > > > > > > > Internet. However, we have not yet found any open source
> > > > > > implementations
> > > > > > > > that are as comprehensive, engineered with the quality
> required
> > > for
> > > > > > > > production systems, and with usable and guaranteed error
> > > properties.
> > > > > > Large
> > > > > > > > Internet companies, such as Google and Facebook, have
> published
> > > > > papers
> > > > > > on
> > > > > > > > sketching, however, their implementations of their published
> > > > > > algorithms are
> > > > > > > > proprietary and not available as open source.
> > > > > > > >
> > > > > > > > The DataSketches library already provides integrations with a
> > > number
> > > > > of
> > > > > > > > major Apache data processing platforms such as Apache Hive,
> > > Apache
> > > > > Pig,
> > > > > > > > Apache Spark and Apache Druid, and is also integrated with a
> > > number
> > > > > of
> > > > > > > > other open source data processing platforms such as Splice
> > > Machine,
> > > > > > GCHQ
> > > > > > > > Gaffer and PostgreSQL.
> > > > > > > >
> > > > > > > > We believe that having DataSketches as an Apache project will
> > > provide
> > > > > > an
> > > > > > > > immediate, worthwhile, and substantial contribution to the
> open
> > > > > source
> > > > > > > > community, will have a better opportunity to provide a
> > meaningful
> > > > > > > > contribution to both the science and engineering of sketching
> > > > > > algorithms,
> > > > > > > > and integrate with other Apache projects.  In addition, this
> > is a
> > > > > > > > significant opportunity for Apache to be the "go-to"
> > destination
> > > for
> > > > > > users
> > > > > > > > that want to leverage this exciting technology.
> > > > > > > >
> > > > > > > > == Initial Goals ==
> > > > > > > >
> > > > > > > > We are breaking our initial goals into short-term (2-6
> months)
> > > and
> > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > >
> > > > > > > > Our short-term goals include:
> > > > > > > >
> > > > > > > > * Understanding and adapting to the Apache development
> process
> > > and
> > > > > > > > structures.
> > > > > > > >
> > > > > > > > * Start refactoring codebase and move various DataSketches
> > > > > repositories
> > > > > > > > code to Apache Git repository.
> > > > > > > >
> > > > > > > > * Continue development of new features, functions, and fixes.
> > > > > > > >
> > > > > > > > * Specific sub-projects (e.g., C++ and Python) will continue
> to
> > > be
> > > > > > > > developed and expanded.
> > > > > > > >
> > > > > > > >
> > > > > > > > The intermediate to long term goals include:
> > > > > > > >
> > > > > > > > * Completing the design and implementation of the C++
> sketches
> > to
> > > > > > > > complement what is already available in Java, and the Python
> > > wrappers
> > > > > > of
> > > > > > > > those C++ sketches.
> > > > > > > >
> > > > > > > > * Expanding the C++ build framework to include Windows and
> the
> > > > > popular
> > > > > > > > Linux variants.
> > > > > > > >
> > > > > > > > * Continued engagement with the scientific research community
> > on
> > > the
> > > > > > > > development of new algorithms for computationally difficult
> > > problems
> > > > > > that
> > > > > > > > heretofore have not had a sketching solution.
> > > > > > > >
> > > > > > > > == Current Status ==
> > > > > > > >
> > > > > > > > The DataSketches GitHub project has been quite successful.
> As
> > of
> > > > > this
> > > > > > > > writing (Feb, 2019) the number of downloads measured by the
> > Nexus
> > > > > > > > Repository Manager at https://oss.sonatype.org has grown by
> > > nearly a
> > > > > > > > factor
> > > > > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > > > > DataSketches/sketches-core repository has about 560 stars and
> > 141
> > > > > > forks,
> > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > >
> > > > > > > > === Development Practices ===
> > > > > > > >
> > > > > > > > ==== Source Control ====
> > > > > > > >
> > > > > > > > All of our developers have extensive experience with Git
> > version
> > > > > > control
> > > > > > > > and follow accepted practices for use of Pull Requests (PRs),
> > > code
> > > > > > reviews
> > > > > > > > and commits to master, for example.
> > > > > > > >
> > > > > > > > ==== Testing ====
> > > > > > > >
> > > > > > > > Sketches, by their nature are probabilistic programs and
> don’t
> > > > > > necessarily
> > > > > > > > behave deterministically.  For some of the sketches we
> > > intentionally
> > > > > > insert
> > > > > > > > random noise into the code as this gives us the mathematical
> > > > > properties
> > > > > > > > that we need to guarantee accuracy.  This can make the
> behavior
> > > of
> > > > > > these
> > > > > > > > algorithms quite unintuitive and provides significant
> > challenges
> > > to
> > > > > the
> > > > > > > > developer who wishes to test these algorithms for
> correctness.
> > > As a
> > > > > > result,
> > > > > > > > our testing strategy includes two major components: unit
> tests,
> > > and
> > > > > > > > characterization tests.
> > > > > > > >
> > > > > > > > ===== Unit Testing =====
> > > > > > > >
> > > > > > > > Our unit tests are primarily quick tests to make sure that we
> > > > > exercise
> > > > > > all
> > > > > > > > critical paths in the code and that key branches are executed
> > > > > > correctly. It
> > > > > > > > is important that they execute relatively fast as they are
> > > generally
> > > > > > run on
> > > > > > > > every code build. The sketches-core repository alone has
> about
> > 22
> > > > > > thousand
> > > > > > > > statements, over 1300 unit tests and code coverage of about
> > > 98.2% as
> > > > > > > > measured by Atlassian/Clover.  It is our goal for all of our
> > code
> > > > > > > > repositories that are used in production that they have code
> > > coverage
> > > > > > > > greater than 90%.
> > > > > > > >
> > > > > > > > ===== Characterization Testing =====
> > > > > > > >
> > > > > > > > In order to test the probabilistic methods that are used to
> > > interpret
> > > > > > the
> > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > characterization
> > > > > > > > repository that is dedicated to this.  To measure accuracy,
> for
> > > > > > example,
> > > > > > > > requires running thousands of trials at each of many
> different
> > > points
> > > > > > along
> > > > > > > > the domain axis. Each trial compares its estimated results
> > > against a
> > > > > > known
> > > > > > > > exact result producing an error for that trial.  These error
> > > > > > measurements
> > > > > > > > are then fed into our Quantiles sketch to capture the actual
> > > > > > distribution
> > > > > > > > of error at that point along the axis. We then select
> quantile
> > > > > contours
> > > > > > > > across all the distributions at points along the axis.  These
> > > > > contours
> > > > > > can
> > > > > > > > then be plotted to reveal the shape of the actual error
> > > distribution.
> > > > > > These
> > > > > > > > distributions are not at all Gaussian, in fact they can be
> > quite
> > > > > > complex.
> > > > > > > > Nonetheless, these distributions are then checked against our
> > > > > > statistical
> > > > > > > > guarantees inherent to the specific sketch algorithm and its
> > > > > > parameters.
> > > > > > > > There are many examples of these characterization error
> > > distributions
> > > > > > on
> > > > > > > > our website. The runtimes of these tests can be very long and
> > can
> > > > > range
> > > > > > > > from many minutes to hours, and some can run for days.
> > > Currently, we
> > > > > > have
> > > > > > > > separate characterization repositories for Java and C++ /
> > Python.
> > > > > > > >
> > > > > > > > It is our goal that we perform this characterization analysis
> > > for all
> > > > > > of
> > > > > > > > our sketches.  By definition, the code that runs these
> > > > > characterization
> > > > > > > > tests is open-source so others can run these tests as well.
> We
> > > do
> > > > > not
> > > > > > have
> > > > > > > > formal releases of this code (because it is not production
> > code)
> > > and
> > > > > > it is
> > > > > > > > not published to Maven Central.
> > > > > > > >
> > > > > > > > === Meritocracy ===
> > > > > > > >
> > > > > > > > DataSketches was initially developed based on requirements
> > within
> > > > > > Yahoo. As
> > > > > > > > a project on GitHub, DataSketches has received contributions
> > from
> > > > > > numerous
> > > > > > > > individual developers from around the world, dedicated
> research
> > > work
> > > > > > from
> > > > > > > > senior scientists at Amazon and Visa, and academic
> researchers
> > > from
> > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > >
> > > > > > > > As a project under incubation, we are committed to expanding
> > our
> > > > > > effort to
> > > > > > > > build an environment which supports a meritocracy. We are
> > > focused on
> > > > > > > > engaging the community and other related projects for support
> > and
> > > > > > > > contributions. Moreover, we are committed to ensure
> > contributors
> > > and
> > > > > > > > committers to DataSketches come from a broad mix of
> > organizations
> > > > > > through a
> > > > > > > > merit-based decision process during incubation. We believe
> > > strongly
> > > > > in
> > > > > > the
> > > > > > > > DataSketches premise that fulfills the concept of a well
> > > engineered
> > > > > and
> > > > > > > > scientifically rigorous library that implements these
> powerful
> > > > > > algorithms
> > > > > > > > and are committed to growing an inclusive community of
> > > DataSketches
> > > > > > > > contributors and users.
> > > > > > > >
> > > > > > > > === Community ===
> > > > > > > >
> > > > > > > > Yahoo has a long history and active engagement in the Open
> > Source
> > > > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > > > > Panoptes,
> > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > TensorFlowOnSpark,
> > > > > > gifshot,
> > > > > > > > fluxible, as well as the creation, contribution and
> incubation
> > of
> > > > > many
> > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> Oozie,
> > > > > > Zookeeper,
> > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > > > >
> > > > > > > > Every day, DataSketches is actively used by a organizations
> and
> > > > > > > > institutions around the world for batch and stream processing
> > of
> > > > > data.
> > > > > > We
> > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > DataSketches-related work, grow the DataSketches community,
> and
> > > > > deepen
> > > > > > > > connections between DataSketches and other open source
> > projects.
> > > > > > > >
> > > > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > > > >
> > > > > > > > The core developers and contributors for DataSketches are
> from
> > > > > diverse
> > > > > > > > backgrounds, but primarily are scientists that love
> engineering
> > > and
> > > > > > > > engineers that love science. A large part of the value we
> bring
> > > comes
> > > > > > from
> > > > > > > > this synthesis.  These individuals have already contributed
> > > > > > substantially
> > > > > > > > to the code, algorithms, and/or mathematical proofs that form
> > the
> > > > > > basis of
> > > > > > > > the library.
> > > > > > > >
> > > > > > > > This core group also form the Initial Committers with write
> > > > > > permissions to
> > > > > > > > the repository. Those marked with (*) Meet weekly to plan the
> > > > > research
> > > > > > and
> > > > > > > > engineering direction of the project.
> > > > > > > >
> > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > >
> > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> Israel.
> > > > > > Interests:
> > > > > > > > distributed systems, scalable systems and platforms for big
> > data
> > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > >
> > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> Labs,
> > > > > > Sunnyvale,
> > > > > > > > California. Interests: algorithms, theoretical and applied
> > > > > mathematics,
> > > > > > > > encoding and compression theory, theoretical and applied
> > > performance
> > > > > > > > optimization.
> > > > > > > >
> > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> > Labs,
> > > Palo
> > > > > > Alto,
> > > > > > > > California. Manages the algorithms group at Amazon AI. We
> build
> > > > > > scalable
> > > > > > > > machine learning systems and algorithms which are used both
> > > > > internally
> > > > > > and
> > > > > > > > externally by customers of SageMaker, AWS's flagship machine
> > > learning
> > > > > > > > platform.
> > > > > > > >
> > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > > Interests:
> > > > > > > > Computational advertising, machine learning, speech
> > recognition,
> > > > > > > > data-driven analysis, large scale experimentation, big data,
> > > > > > stream/complex
> > > > > > > > event processing
> > > > > > > >
> > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > Computer
> > > > > > Science,
> > > > > > > > Georgetown University, Washington D.C. Interests: algorithms
> > and
> > > > > > > > computational complexity, complexity theory, quantum
> > algorithms,
> > > > > > private
> > > > > > > > data analysis, and learning theory, developing efficient
> > > streaming
> > > > > and
> > > > > > > > sketching algorithms
> > > > > > > >
> > > > > > > > ==== Engineers That Love Science ====
> > > > > > > >
> > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> > Snap.
> > > > > > Interests:
> > > > > > > > design and implementation of data storing and data processing
> > > > > > (distributed)
> > > > > > > > systems, performance optimization, CPU performance,
> mechanical
> > > > > > sympathy,
> > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > structures,
> > > > > > > > memory management, garbage collection algorithms, language
> > > design and
> > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > efficiency,
> > > > > > Linux,
> > > > > > > > code quality, code transformation, pure functional
> programming
> > > > > models,
> > > > > > > > Haskell.
> > > > > > > >
> > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> > > founder
> > > > > > of
> > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > Interests:
> > > > > > > > streaming algorithms, mathematics, computer science, high
> > > quality and
> > > > > > high
> > > > > > > > performance code for the analysis of massive data, bridging
> the
> > > > > divide
> > > > > > > > between theory and practice.
> > > > > > > >
> > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > > Sunnyvale,
> > > > > > > > California. Interests: applied mathematics, computer science,
> > big
> > > > > data,
> > > > > > > > distributed systems.
> > > > > > > >
> > > > > > > > === Introduction to Additional Interested Contributors ===
> > > > > > > >
> > > > > > > > These folks have been intermittently involved and
> contributed,
> > > but
> > > > > are
> > > > > > > > strong supporters of this project.
> > > > > > > >
> > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > >
> > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > Computer
> > > > > > Science,
> > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> matrix
> > > > > > > > approximation, streaming algorithms, randomized linear
> algebra.
> > > > > > > >
> > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> Ph.D.
> > > > > > Computer
> > > > > > > > Science, Research Instructor, Princeton University.
> Interests:
> > > > > > algorithmic
> > > > > > > > foundations of data science and machine learning, efficient
> > > methods
> > > > > for
> > > > > > > > processing and understanding large datasets, often working at
> > the
> > > > > > > > intersection of theoretical computer science, numerical
> linear
> > > > > > algebra, and
> > > > > > > > optimization.
> > > > > > > >
> > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > Computer
> > > > > > Science,
> > > > > > > > Professor, Warwick University, Warwick, England. Interests:
> all
> > > > > > aspects of
> > > > > > > > the "data lifecycle", from data collection and cleaning,
> > through
> > > > > > mining and
> > > > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > > > scientists
> > > > > > in
> > > > > > > > sketching algorithms)
> > > > > > > >
> > > > > > > > === Alignment ===
> > > > > > > >
> > > > > > > > The DataSketches library already provides integrations and
> > > example
> > > > > > code for
> > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> integrated
> > > into
> > > > > > Apache
> > > > > > > > Druid.
> > > > > > > >
> > > > > > > > == Known Risks ==
> > > > > > > >
> > > > > > > > The following subsections are specific risks that have been
> > > > > identified
> > > > > > by
> > > > > > > > the ASF that need to be addressed.
> > > > > > > >
> > > > > > > > === Risk: Orphaned Products ===
> > > > > > > >
> > > > > > > > The DataSketches library is presently used by a number of
> > > > > > organizations,
> > > > > > > > from small startups to Fortune 100 companies, to construct
> > > production
> > > > > > > > pipelines that must process and analyze massive data. Yahoo
> > has a
> > > > > > long-term
> > > > > > > > commitment to continue to advance the DataSketches library;
> > > moreover,
> > > > > > > > DataSketches is seeing increasing interest, development, and
> > > adoption
> > > > > > from
> > > > > > > > many diverse organizations from around the world. Due to its
> > > growing
> > > > > > > > adoption, we feel it is quite unlikely that this project
> would
> > > become
> > > > > > > > orphaned.
> > > > > > > >
> > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > >
> > > > > > > > Yahoo believes strongly in open source and the exchange of
> > > > > information
> > > > > > to
> > > > > > > > advance new ideas and work. Examples of this commitment are
> > > active
> > > > > open
> > > > > > > > source projects such as those mentioned above. With
> > > DataSketches, we
> > > > > > have
> > > > > > > > been increasingly open and forward-looking; we have
> published a
> > > > > number
> > > > > > of
> > > > > > > > papers about breakthrough developments in the science of
> > > streaming
> > > > > > > > algorithms (mentioned above) that also reference the
> > DataSketches
> > > > > > library.
> > > > > > > > Our submission to the Apache Software Foundation is a logical
> > > > > > extension of
> > > > > > > > our commitment to open source software.
> > > > > > > >
> > > > > > > > Key committers at Yahoo with strong open source backgrounds
> > > include
> > > > > > Aaron
> > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> Braginsky,
> > > > > Andrews
> > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> > Call,
> > > > > Daryn
> > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> Eshcar
> > > > > Hillel,
> > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > Perez-Sorrosal,
> > > > > > Gil
> > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> > James
> > > > > > Penick,
> > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > Eagles,
> > > > > > Kihwal
> > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > > Trelinski,
> > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > > > Natkovich,
> > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> Ruby
> > > Loo,
> > > > > > Ryan
> > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
> > > Chan,
> > > > > Sri
> > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> > more.
> > > > > > > >
> > > > > > > > All of our core developers are committed to learn about the
> > > Apache
> > > > > > process
> > > > > > > > and to give back to the community.
> > > > > > > >
> > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > >
> > > > > > > > The majority of committers in this proposal belong to Yahoo
> due
> > > to
> > > > > the
> > > > > > fact
> > > > > > > > that DataSketches has emerged from an internal Yahoo project.
> > > This
> > > > > > proposal
> > > > > > > > also includes developers and contributors from other
> companies,
> > > and
> > > > > > who are
> > > > > > > > actively involved with other Apache projects, such as Druid.
> > We
> > > > > > expect our
> > > > > > > > entry into incubation will allow us to expand the number of
> > > > > > individuals and
> > > > > > > > organizations participating in DataSketches development.
> > > > > > > >
> > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > >
> > > > > > > > Because the DataSketches library originated within Yahoo, it
> > has
> > > been
> > > > > > > > developed primarily by salaried Yahoo developers and we
> expect
> > > that
> > > > > to
> > > > > > > > continue to be the case near term. However, since we placed
> > this
> > > > > > library
> > > > > > > > into open-source we have had a number of significant
> > > contributions
> > > > > from
> > > > > > > > engineers and scientists from outside of Yahoo. We expect our
> > > > > reliance
> > > > > > on
> > > > > > > > Yahoo salaried developers will decrease over time.
> Nonetheless,
> > > Yahoo
> > > > > > is
> > > > > > > > committed to continue its strong support of this important
> > > project.
> > > > > > > >
> > > > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > > > >
> > > > > > > > DataSketches already directly interoperates with or utilizes
> > > several
> > > > > > > > existing Apache projects.
> > > > > > > >
> > > > > > > > * Build
> > > > > > > >    * Apache Maven
> > > > > > > >
> > > > > > > > * Integrations and adaptors for the following projects
> > naturally
> > > have
> > > > > > them
> > > > > > > > as dependencies
> > > > > > > >    * Apache Hive
> > > > > > > >    * Apache Pig
> > > > > > > >    * Apache Druid
> > > > > > > >    * Apache Spark
> > > > > > > >
> > > > > > > > * Additional dependencies for the above integrations and
> > adaptors
> > > > > > include
> > > > > > > >    * Apache Hadoop
> > > > > > > >    * Apache Commons (Math)
> > > > > > > >
> > > > > > > > There is no other Apache project that we are aware of that
> > > duplicates
> > > > > > the
> > > > > > > > functionality of the DataSketches library.
> > > > > > > >
> > > > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > > > > >
> > > > > > > > With this proposal we are not seeking attention or publicity.
> > > Rather,
> > > > > > we
> > > > > > > > firmly believe in the DataSketches library and concept and
> the
> > > > > ability
> > > > > > to
> > > > > > > > make the DataSketches library a powerful, yet simple-to-use
> > > toolkit
> > > > > for
> > > > > > > > data processing. While the DataSketches library has been open
> > > source,
> > > > > > we
> > > > > > > > believe putting code on GitHub can only go so far. We see the
> > > Apache
> > > > > > > > community, processes, and mission as critical for ensuring
> the
> > > > > > DataSketches
> > > > > > > > library is truly community-driven, positively impactful, and
> > > > > innovative
> > > > > > > > open source software. While Yahoo has taken a number of steps
> > to
> > > > > > advance
> > > > > > > > its various open source projects, we believe the DataSketches
> > > library
> > > > > > > > project is a great fit for the Apache Software Foundation due
> > to
> > > its
> > > > > > focus
> > > > > > > > on data processing and its relationships to existing ASF
> > > projects.
> > > > > > > >
> > > > > > > > === Risk: Cryptography ===
> > > > > > > >
> > > > > > > > DataSketches does not contain any cryptographic code and is
> > not a
> > > > > > > > cryptographic product.
> > > > > > > >
> > > > > > > > == Documentation ==
> > > > > > > >
> > > > > > > > The following documentation is relevant to this proposal.
> > > Relevant
> > > > > > portions
> > > > > > > > of the documentation will be contributed to the Apache
> > > DataSketches
> > > > > > > > project.
> > > > > > > >
> > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > >
> > > > > > > > * DataSketches website repository:
> > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > >
> > > > > > > > We will need an apache website for this documentation similar
> > to
> > > > > > > >
> > > > > > > > * https://datasketches.apache.org
> > > > > > > >
> > > > > > > > == Initial Source ==
> > > > > > > >
> > > > > > > > The initial source for DataSketches which we will submit to
> the
> > > > > Apache
> > > > > > > > Foundation will include a number of repositories which are
> > > currently
> > > > > > hosted
> > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > >
> > > > > > > > All github.com/datasketches repositories including:
> > > > > > > >
> > > > > > > > * Java
> > > > > > > >    * sketches-core: This repository has the core sketching
> > > classes,
> > > > > > which
> > > > > > > > are leveraged by some of the other repositories. This
> > repository
> > > has
> > > > > no
> > > > > > > > external dependencies outside of the DataSketches/memory
> > > repository,
> > > > > > Java
> > > > > > > > and TestNG for unit tests. This code is versioned and the
> > latest
> > > > > > release
> > > > > > > > can be obtained from Maven Central.
> > > > > > > >    * memory: Low level, high-performance memory
> data-structure
> > > > > > management
> > > > > > > > primarily for off-heap.
> > > > > > > >    * sketches-android: This is a new repository dedicated to
> > > sketches
> > > > > > > > designed to be run in a mobile client, such as a cell phone.
> It
> > > is
> > > > > > still in
> > > > > > > > development and should be considered experimental.
> > > > > > > >    * sketches-hive: This repository contains Hive UDFs and
> > UDAFs
> > > for
> > > > > > use
> > > > > > > > within Hadoop grid environments. This code has dependencies
> on
> > > > > > > > sketches-core as well as Hadoop and Hive. Users of this code
> > are
> > > > > > advised to
> > > > > > > > use Maven to bring in all the required dependencies. This
> code
> > is
> > > > > > versioned
> > > > > > > > and the latest release can be obtained from Maven Central.
> > > > > > > >    * sketches-pig: This repository contains Pig User Defined
> > > > > Functions
> > > > > > > > (UDF) for use within Hadoop grid environments. This code has
> > > > > > dependencies
> > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> code
> > > are
> > > > > > advised
> > > > > > > > to use Maven to bring in all the required dependencies. This
> > > code is
> > > > > > > > versioned and the latest release can be obtained from Maven
> > > Central.
> > > > > > > >    * sketches-vector: This is a new repository dedicated to
> > > sketches
> > > > > > for
> > > > > > > > vector and matrix operations. It is still somewhat
> > experimental.
> > > > > > > >    * characterization: This relatively new repository is for
> > code
> > > > > that
> > > > > > we
> > > > > > > > use to characterize the accuracy and speed performance of the
> > > > > sketches
> > > > > > in
> > > > > > > > the library and is constantly being updated. Examples of the
> > job
> > > > > > command
> > > > > > > > files used for various tests can be found in the
> > > src/main/resources
> > > > > > > > directory. Some of these tests can run for hours depending on
> > its
> > > > > > > > configuration.
> > > > > > > >    * experimental: This repository is an experimental staging
> > > area
> > > > > for
> > > > > > code
> > > > > > > > that will eventually end up in another repository. This code
> is
> > > not
> > > > > > > > versioned and not registered with Maven Central.
> > > > > > > >    * sketches-misc: Demos and other code not related to
> > > production
> > > > > > > > deployment
> > > > > > > >
> > > > > > > > * C++ and Python
> > > > > > > >    * sketches-core-cpp: This is the C++/Python companion to
> the
> > > Java
> > > > > > > > sketches-core. These implementations are binary compatible
> with
> > > their
> > > > > > > > counterparts in Java. In other words, a sketch created and
> > > stored in
> > > > > > C++
> > > > > > > > can be opened and read in Java and visa-versa. This site also
> > > has our
> > > > > > > > Python adaptors that basically wrap the C++ implementations,
> > > making
> > > > > the
> > > > > > > > high performance C++ implementations available from Python.
> > > > > > > >    * sketches-postgres: This site provides the
> > postgres-specific
> > > > > > adaptors
> > > > > > > > that wrap the C++ implementations making them available to
> the
> > > > > Postgres
> > > > > > > > database users.
> > > > > > > >    * characterization-cpp: This is the C++/Python companion
> to
> > > the
> > > > > Java
> > > > > > > > characterization repository.
> > > > > > > >    * experimental-cpp: This repository is an experimental
> > staging
> > > > > area
> > > > > > for
> > > > > > > > C++ code that will eventually end up in another repository.
> > > > > > > >
> > > > > > > > * Command-Line Tools
> > > > > > > >    * sketches-cmd
> > > > > > > >    * homebrew-sketches
> > > > > > > >    * homebrew-sketches-cmd
> > > > > > > >
> > > > > > > > These projects have always been Apache 2.0 licensed. We
> intend
> > to
> > > > > > bundle
> > > > > > > > all of these repositories since they are all complementary
> and
> > > should
> > > > > > be
> > > > > > > > maintained in one project. Prior to our submission, we will
> > > combine
> > > > > > all of
> > > > > > > > these projects into a new git repository.
> > > > > > > >
> > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > >
> > > > > > > > Contributors to the DataSketches project have also signed the
> > > Yahoo
> > > > > > > > Individual Contributor License Agreement (
> > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > in order to contribute to the project.
> > > > > > > >
> > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > trademark on
> > > > > > the
> > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > receive
> > > > > > during the
> > > > > > > > incubation process, we are open to renaming the project if
> > > necessary
> > > > > > for
> > > > > > > > trademark or other concerns, but we would prefer not to have
> to
> > > do
> > > > > > that.
> > > > > > > >
> > > > > > > > == External Dependencies ==
> > > > > > > >
> > > > > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > > > > Apache-compatible license. As we grow the DataSketches
> > community
> > > we
> > > > > > will
> > > > > > > > configure our build process to require and validate all
> > > contributions
> > > > > > and
> > > > > > > > dependencies are licensed under the Apache 2.0 license or are
> > > under
> > > > > an
> > > > > > > > Apache-compatible license.
> > > > > > > >
> > > > > > > > == Required Resources ==
> > > > > > > >
> > > > > > > > === Mailing Lists ===
> > > > > > > >
> > > > > > > > We currently use a mix of mailing lists. We will migrate our
> > > existing
> > > > > > > > mailing lists to the following:
> > > > > > > >
> > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > === Source Control ===
> > > > > > > >
> > > > > > > > The DataSketches team currently uses Git and would like to
> > > continue
> > > > > to
> > > > > > do
> > > > > > > > so. We request a Git repository for DataSketches with
> mirroring
> > > to
> > > > > > GitHub
> > > > > > > > enabled similar the following:
> > > > > > > >
> > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > >
> > > > > > > > === Issue Tracking ===
> > > > > > > >
> > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > DataSketches
> > > > > > project
> > > > > > > > is currently using the public GitHub issue tracker and the
> > public
> > > > > > Google
> > > > > > > > Groups forum/sketches-user for issue tracking and
> discussions.
> > We
> > > > > will
> > > > > > > > migrate and combine from these two sources to the Apache
> JIRA.
> > > > > > > >
> > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > >
> > > > > > > > == Initial Committers ==
> > > > > > > >
> > > > > > > > The following list of individuals have been extremely active
> in
> > > our
> > > > > > > > community and should have write (commit) permissions to the
> > > > > repository.
> > > > > > > >
> > > > > > > > * Eshcar Hillel                      [eshcar at verizonmedia
> > dot
> > > com]
> > > > > > > >
> > > > > > > > * Kevin Lang                    [langk at verizonmedia dot
> com]
> > > > > > > >
> > > > > > > > * Roman Leventov              [roman.leventov at
> c.metamarkets
> > > dot
> > > > > com]
> > > > > > > >
> > > > > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > > > > >
> > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot
> > com]
> > > > > > > >
> > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> > com] &
> > > > > > [leerho
> > > > > > > > at gmail dot com]
> > > > > > > >
> > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot
> com]
> > > > > > > >
> > > > > > > > * Justin Thaler                 [justin.thaler at georgetown
> > dot
> > > edu]
> > > > > > > >
> > > > > > > > == Affiliations ==
> > > > > > > >
> > > > > > > > The initial committers are from four organizations: Yahoo,
> > > Amazon,
> > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > >
> > > > > > > > === Champion ===
> > > > > > > > (Recommended to me: )
> > > > > > > >
> > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> [chenliang613
> > at
> > > > > > apache
> > > > > > > > dot org]
> > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > >
> > > > > > > > === Nominated Mentors ===
> > > > > > > > (Recommended to me: )
> > > > > > > >
> > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> [chenliang613
> > at
> > > > > > apache
> > > > > > > > dot org]
> > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > >
> > > > > > > > === Sponsoring Entity ===
> > > > > > > >
> > > > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > > > >
> > > > > > > > * Apache Druid. The incubating Apache Druid project might
> also
> > > be a
> > > > > > logical
> > > > > > > > sponsor. However, DataSketches has applications in many areas
> > of
> > > > > > computing
> > > > > > > > outside of Druid so our preference and recommendation is that
> > > > > > DataSketches
> > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > >
> > > > > > > > ________________
> > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > previously
> > > > > > acquired
> > > > > > > > AOL. The merged entity was originally called Oath, Inc., but
> > has
> > > > > > recently
> > > > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary
> of
> > > > > Verizon,
> > > > > > > > Inc.  Since Yahoo is the more recognized name, references in
> > this
> > > > > > document
> > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > kenn@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > The subject line has me interested already. Follow examples
> > > like
> > > > > this
> > > > > > > > > maybe?
> > > > > > > > >
> > > > > > > > > 1.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > 2.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > >
> > > > > > > > > Kenn
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > I'll try again ... :)
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > ted.dunning@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> It didn't make it again
> > > > > > > > > >>
> > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > leerho@gmail.com>
> > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > > general-help@incubator.apache.org
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > > For additional commands, e-mail:
> general-help@incubator.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > --
> > > > From my cell phone.
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>
-- 
Sent from my Mobile device

Re: DataSketches Proposal - Google Docs Link

Posted by leerho <le...@gmail.com>.

Thank you!


On Mon, Feb 25, 2019 at 9:37 PM Kenneth Knowles <ke...@apache.org> wrote:

> It isn't too much work, so I've done it:
> https://s.apache.org/datasketches-proposal-draft
>
> Kenn
>
> On Mon, Feb 25, 2019 at 9:31 PM leerho <le...@gmail.com> wrote:
>
> > Yes, I thought of that.  But it’s not like I’m being overwhelmed with
> > requests to comment ... so far it has been only 3 or 4, and the requested
> > changes have been minor.  I’m assuming that if there are no more
> > substantive changes after this week that the document would be moved to
> the
> > wiki archive, where, I presume, changes could still be made.
> >
> > I want to do the right thing here, so if you feel that the document would
> > get much better feedback on an unrestricted gDoc site, I will set it up.
> >
> >
> >
> > On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jb...@cloudera.com.invalid>
> > wrote:
> >
> > > You could use a Google account that is not under Yahoo’s control, then
> > let
> > > anyone in the world add a comment, maybe.
> > >
> > > On Mon, Feb 25, 2019 at 3:26 PM leerho <le...@gmail.com> wrote:
> > >
> > > > Ken,
> > > > Yahoo does not allow me to create a shared link outside our company,
> > > except
> > > > to individual email addresses.  So attempting to share it to the
> email
> > > > general@incubator.apache.org may not work.  Nonetheless, several
> > > > individuals were able to request access using their individual email
> > > > accounts and I was able to add them.  I will try to add you using
> > > > kenn@apache.org, but if that doesn't work, I may need a gmail or
> > > > equivalent
> > > > account for you.
> > > >
> > > > Lee.
> > > >
> > > >
> > > > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org>
> > wrote:
> > > >
> > > > > I could not access that document. I suggest you need to turn on
> link
> > > > > sharing.
> > > > >
> > > > > Kenn
> > > > >
> > > > > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <
> leerho@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Try this link:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > > > >
> > > > > >
> > > > > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > > > > Yes I will try that tomorrow.
> > > > > > >
> > > > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <
> kenn@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Can you share the Google doc with the proposal? Per Ted's
> > advice,
> > > > we
> > > > > > can
> > > > > > > > iterate quickly there and move it to the wiki when it
> becomes a
> > > bit
> > > > > > more
> > > > > > > > stable.
> > > > > > > >
> > > > > > > > Kenn
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> > > > leerho@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the offer.  i am a neophyte at this process and
> > > email
> > > > > > app!   I
> > > > > > > > > could use a lot of help getting this off the ground!  Also,
> > I'm
> > > > not
> > > > > > sure
> > > > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking
> this
> > on
> > > > :)
> > > > > > > > >
> > > > > > > > > Lee.
> > > > > > > > >
> > > > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > > > > > > > Nice.
> > > > > > > > > >
> > > > > > > > > > I would very much like to help mentor this project,
> though
> > > you
> > > > > > already
> > > > > > > > > have
> > > > > > > > > > a couple good ones.
> > > > > > > > > >
> > > > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > > > >
> > > > > > > > > > Kenn (VP Apache Beam)
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <leerho@gmail.com
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I didn't realize that this mail list does not accept
> PDF
> > > > files,
> > > > > > > > > apparently
> > > > > > > > > > > only text.  So let me try one more time ... :)  Please
> > let
> > > me
> > > > > > know if
> > > > > > > > > > > this works!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > > > >
> > > > > > > > > > > == Abstract ==
> > > > > > > > > > >
> > > > > > > > > > > DataSketches.GitHub.io is an open source,
> > high-performance
> > > > > > library
> > > > > > > > of
> > > > > > > > > > > stochastic streaming algorithms commonly called
> > "sketches"
> > > in
> > > > > the
> > > > > > > > data
> > > > > > > > > > > sciences. Sketches are small, stateful programs that
> > > process
> > > > > > massive
> > > > > > > > > data
> > > > > > > > > > > as a stream and can provide approximate answers, with
> > > > > > mathematical
> > > > > > > > > > > guarantees, to computationally difficult queries
> > > > > > orders-of-magnitude
> > > > > > > > > faster
> > > > > > > > > > > than traditional, exact methods.
> > > > > > > > > > >
> > > > > > > > > > > This proposal is to move DataSketches to the Apache
> > > Software
> > > > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > > > intellectual
> > > > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > > > officially
> > > > > > > > > known as
> > > > > > > > > > > Apache DataSketches and its evolution and governance
> > would
> > > > come
> > > > > > under
> > > > > > > > > the
> > > > > > > > > > > rules and guidance of the ASF.
> > > > > > > > > > >
> > > > > > > > > > > == Introduction ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library contains carefully crafted
> > > > > > implementations
> > > > > > > > of
> > > > > > > > > > > sketch algorithms that meet rigorous standards of
> quality
> > > and
> > > > > > > > > performance
> > > > > > > > > > > and provide capabilities required for large-scale
> > > production
> > > > > > systems
> > > > > > > > > that
> > > > > > > > > > > must process and analyze massive data. The DataSketches
> > > core
> > > > > > > > > repository is
> > > > > > > > > > > written in Java with a parallel core repository written
> > in
> > > > C++
> > > > > > that
> > > > > > > > > > > includes Python wrappers. The DataSketches library also
> > > > > includes
> > > > > > > > > special
> > > > > > > > > > > repositories for extending the core library for Apache
> > Hive
> > > > and
> > > > > > > > Apache
> > > > > > > > > Pig.
> > > > > > > > > > > The sketches developed in the different languages
> share a
> > > > > common
> > > > > > > > binary
> > > > > > > > > > > storage format so that sketches created and stored in
> > Java,
> > > > for
> > > > > > > > > example,
> > > > > > > > > > > can be fully used in C++, and visa versa.  Because the
> > > stored
> > > > > > sketch
> > > > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > > > images),
> > > > > > they
> > > > > > > > > can
> > > > > > > > > > > be shared across many different systems, languages and
> > > > > platforms.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches documentation website,
> > > > > > > > https://datasketches.github.io
> > > > > > > > > ,
> > > > > > > > > > > includes general tutorials, a comprehensive research
> > > section
> > > > > with
> > > > > > > > > > > references to relevant academic papers, extensive
> > examples
> > > > for
> > > > > > using
> > > > > > > > > the
> > > > > > > > > > > core library directly as well as examples for accessing
> > the
> > > > > > library
> > > > > > > > in
> > > > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library also includes a
> characterization
> > > > > > repository
> > > > > > > > > for
> > > > > > > > > > > long running test programs that are used for studying
> > > > accuracy
> > > > > > and
> > > > > > > > > > > performance of these sketches over wide ranges of input
> > > > > > variables.
> > > > > > > > The
> > > > > > > > > data
> > > > > > > > > > > produced by these programs is used for generating the
> > many
> > > > > > > > performance
> > > > > > > > > > > plots contained in the documentation website and for
> > > academic
> > > > > > > > > > > publications.
> > > > > > > > > > >
> > > > > > > > > > > The code repositories used for production are versioned
> > and
> > > > > > published
> > > > > > > > > to
> > > > > > > > > > > Maven Central on periodic intervals as the library
> > evolves.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library also includes several
> > experimental
> > > > > > > > > repositories
> > > > > > > > > > > for use-cases outside the large-scale systems
> > environments,
> > > > > such
> > > > > > as
> > > > > > > > > > > sketches for mobile, IoT devices (Android),
> command-line
> > > > access
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > sketch library, and an experimental repository for
> > > > vector-based
> > > > > > > > > sketches
> > > > > > > > > > > that performs approximate Singular Value Decomposition
> > > (SVD)
> > > > > > analysis
> > > > > > > > > that
> > > > > > > > > > > could potentially be used in Machine Learning (ML)
> > > > > applications.
> > > > > > > > > > >
> > > > > > > > > > > == Background ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library was started in 2012 as
> internal
> > > > Yahoo
> > > > > > > > project
> > > > > > > > > to
> > > > > > > > > > > dramatically reduce time and resources required for
> > > distinct
> > > > > > (unique)
> > > > > > > > > > > counting.  An extensive search on the Internet at the
> > time
> > > > > > yielded a
> > > > > > > > > number
> > > > > > > > > > > of theoretical papers on stochastic streaming
> algorithms
> > > with
> > > > > > > > > pseudocode
> > > > > > > > > > > examples, but we did not find any usable open-source
> code
> > > of
> > > > > the
> > > > > > > > > quality we
> > > > > > > > > > > felt we needed for our internal production systems.  So
> > we
> > > > > > started a
> > > > > > > > > small
> > > > > > > > > > > project (one person) to develop our own sketches
> working
> > > > > directly
> > > > > > > > from
> > > > > > > > > > > published theoretical papers.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library was designed from the start
> with
> > > the
> > > > > > > > > objective of
> > > > > > > > > > > making these algorithms, usually only described in
> > > > theoretical
> > > > > > > > papers,
> > > > > > > > > > > easily accessible to systems developers for use in our
> > > > internal
> > > > > > > > > production
> > > > > > > > > > > systems. By necessity, the code had to be of the
> highest
> > > > > quality
> > > > > > and
> > > > > > > > > > > thoroughly tested. The wide variety of our internal
> > > > production
> > > > > > > > systems
> > > > > > > > > > > drove the requirement that the sketch implementations
> had
> > > to
> > > > > > have an
> > > > > > > > > > > absolute minimum of external, run-time dependencies in
> > > order
> > > > to
> > > > > > > > > simplify
> > > > > > > > > > > integration and troubleshooting.
> > > > > > > > > > >
> > > > > > > > > > > Our internal experiments demonstrated dramatic positive
> > > > impact
> > > > > > on the
> > > > > > > > > > > performance of our systems.  As a result, the
> > DataSketches
> > > > > > library
> > > > > > > > > quickly
> > > > > > > > > > > evolved to include different types of sketches for
> > > different
> > > > > > types of
> > > > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > > > algorithms,
> > > > > > > > > > > quantile/histogram algorithms, and weighted and
> > unweighted
> > > > > > sampling
> > > > > > > > > > > algorithms.
> > > > > > > > > > >
> > > > > > > > > > > We quickly discovered that developing these sketch
> > > algorithms
> > > > > to
> > > > > > be
> > > > > > > > > truly
> > > > > > > > > > > robust in production environments is quite difficult
> and
> > > > > requires
> > > > > > > > deep
> > > > > > > > > > > understanding of the underlying mathematics and
> > statistics
> > > as
> > > > > > well as
> > > > > > > > > > > extensive experience in developing high quality code
> for
> > > 24/7
> > > > > > > > > production
> > > > > > > > > > > systems. This is a difficult combination of skills for
> > any
> > > > one
> > > > > > > > > organization
> > > > > > > > > > > to collect and maintain over time. It became clear that
> > > this
> > > > > > > > technology
> > > > > > > > > > > needed a community larger than Yahoo to evolve.  In
> > > November,
> > > > > > 2015,
> > > > > > > > > this
> > > > > > > > > > > factor, along with Yahoo’s strong experience and
> support
> > of
> > > > > open
> > > > > > > > > source,
> > > > > > > > > > > led to the decision to open source this technology
> under
> > an
> > > > > > Apache
> > > > > > > > 2.0
> > > > > > > > > > > license on GitHub. Since that time our community has
> > > expanded
> > > > > > > > > considerably
> > > > > > > > > > > and the key contributors to this effort includes
> leading
> > > > > research
> > > > > > > > > > > scientists from a number of universities as well as
> > > > > > practitioners and
> > > > > > > > > > > researchers from a number of major corporations. The
> core
> > > of
> > > > > this
> > > > > > > > > group is
> > > > > > > > > > > very active as we meet weekly to discuss research
> > > directions
> > > > > and
> > > > > > > > > > > engineering priorities.
> > > > > > > > > > >
> > > > > > > > > > > It is important to note that our internal systems at
> > Yahoo
> > > > use
> > > > > > the
> > > > > > > > > current
> > > > > > > > > > > public GitHub open source DataSketches library and not
> an
> > > > > > internal
> > > > > > > > > version
> > > > > > > > > > > of the code.
> > > > > > > > > > >
> > > > > > > > > > > The close collaboration of scientific research and
> > > > engineering
> > > > > > > > > development
> > > > > > > > > > > experience with actual massive-data processing systems
> > has
> > > > also
> > > > > > > > > produced
> > > > > > > > > > > new research publications in the field of stochastic
> > > > streaming
> > > > > > > > > algorithms,
> > > > > > > > > > > for example:
> > > > > > > > > > >
> > > > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo
> > Liberty,
> > > > Lee
> > > > > > > > > Rhodes, and
> > > > > > > > > > > Justin Thaler. A high-performance algorithm for
> > identifying
> > > > > > frequent
> > > > > > > > > items
> > > > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > > > >
> > > > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and
> Justin
> > > > > > Thaler. A
> > > > > > > > > > > framework for estimating stream expression
> cardinalities.
> > > In
> > > > > > > > *EDBT/ICDT
> > > > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips.
> Efficient
> > > > > > Frequent
> > > > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > > > Proceedings
> > > > > > > > > ‘16,
> > > > > > > > > > > pages 845-854, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty.
> > Optimal
> > > > > > quantile
> > > > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> > > pages
> > > > > > 71–78,
> > > > > > > > > 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > > > optimal
> > > > > > > > > cardinality
> > > > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > > > 2017.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty. Simple and deterministic matrix
> sketching.
> > > In
> > > > > ACM
> > > > > > KDD
> > > > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > > > Jonathan
> > > > > > > > > Ullman.
> > > > > > > > > > > Space lower bounds for itemset frequency sketches. In
> ACM
> > > > PODS
> > > > > > > > > Proceedings
> > > > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin
> > Thaler.
> > > > > > > > Hierarchical
> > > > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> > > ALENEX
> > > > > > > > > Proceedings
> > > > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > > > >
> > > > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > > > >
> > > > > > > > > > > In the analysis of big data there are often problem
> > queries
> > > > > that
> > > > > > > > don’t
> > > > > > > > > > > scale because they require huge compute resources and
> > time
> > > to
> > > > > > > > generate
> > > > > > > > > > > exact results. Examples include count distinct,
> > quantiles,
> > > > most
> > > > > > > > > frequent
> > > > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > > > >
> > > > > > > > > > > If we can loosen the requirement of “exact” results
> from
> > > our
> > > > > > queries
> > > > > > > > > and be
> > > > > > > > > > > satisfied with approximate results, within some well
> > > > understood
> > > > > > > > bounds
> > > > > > > > > of
> > > > > > > > > > > error, there is an entire branch of mathematics and
> data
> > > > > science
> > > > > > that
> > > > > > > > > has
> > > > > > > > > > > evolved around developing algorithms that can produce
> > > > > approximate
> > > > > > > > > results
> > > > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > > > >
> > > > > > > > > > > With the additional requirements that these algorithms
> > must
> > > > be
> > > > > > small
> > > > > > > > > > > (compared to the size of the input data), sublinear
> (the
> > > size
> > > > > of
> > > > > > the
> > > > > > > > > sketch
> > > > > > > > > > > must grow at a slower rate than the size of the input
> > > > stream),
> > > > > > > > > streaming
> > > > > > > > > > > (they can only touch each data item once), and
> mergeable
> > > > > > (suitable
> > > > > > > > for
> > > > > > > > > > > distributed processing), defines a class of algorithms
> > that
> > > > can
> > > > > > be
> > > > > > > > > > > described as small, stochastic, streaming, sublinear
> > > > mergeable
> > > > > > > > > algorithms,
> > > > > > > > > > > commonly called sketches (they also have other names,
> but
> > > we
> > > > > > will use
> > > > > > > > > the
> > > > > > > > > > > term sketches from here on).
> > > > > > > > > > >
> > > > > > > > > > > To be truly streaming and be able to process data in a
> > > single
> > > > > > pass,
> > > > > > > > > > > sketches must make absolute minimum assumptions about
> the
> > > > input
> > > > > > > > stream.
> > > > > > > > > > > This is critically important, as there is no “second
> > > chance”
> > > > to
> > > > > > > > > process the
> > > > > > > > > > > data.
> > > > > > > > > > >
> > > > > > > > > > > For example, sketches should not make assumptions about
> > the
> > > > > > order of
> > > > > > > > > stream
> > > > > > > > > > > items, the stream length, the dynamic range of values,
> or
> > > the
> > > > > > > > > distribution
> > > > > > > > > > > of item occurrence frequencies. Sketches should be
> > tolerant
> > > > of
> > > > > > NaNs,
> > > > > > > > > Nulls
> > > > > > > > > > > and empty objects. About the only thing that the sketch
> > > needs
> > > > > to
> > > > > > know
> > > > > > > > > about
> > > > > > > > > > > the stream is how to extract items from it and what
> type
> > > the
> > > > > > item is,
> > > > > > > > > e.g.,
> > > > > > > > > > > is it a numeric value or a string.
> > > > > > > > > > >
> > > > > > > > > > > As far as the sketch is concerned, the input stream is
> a
> > > > > > sequence of
> > > > > > > > > items
> > > > > > > > > > > in some unknown random order with unknown random
> values.
> > > > > > > > > > >
> > > > > > > > > > > The sketch is essentially a complex state machine and
> > > > combined
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > random input stream defines a stochastic process. We
> then
> > > > apply
> > > > > > > > > > > probabilistic methods to interpret the states of the
> > > > stochastic
> > > > > > > > > process in
> > > > > > > > > > > order to extract useful information about the input
> > stream
> > > > > > itself.
> > > > > > > > The
> > > > > > > > > > > resulting information will be approximate, but we also
> > use
> > > > > > additional
> > > > > > > > > > > probabilistic methods to extract an estimate of the
> > likely
> > > > > > > > probability
> > > > > > > > > > > distribution of error.
> > > > > > > > > > >
> > > > > > > > > > > There is a significant scientific contribution here
> that
> > is
> > > > > > defining
> > > > > > > > > the
> > > > > > > > > > > state machine, understanding the resulting stochastic
> > > > process,
> > > > > > > > > developing
> > > > > > > > > > > the probabilistic methods, and proving mathematically,
> > that
> > > > it
> > > > > > all
> > > > > > > > > works!
> > > > > > > > > > > This is why the scientific contributors to this project
> > > are a
> > > > > > > > critical
> > > > > > > > > and
> > > > > > > > > > > strategic component to our success.  The development
> > > > engineers
> > > > > > > > > translate
> > > > > > > > > > > the concepts of the proposed state machine and
> > > probabilistic
> > > > > > methods
> > > > > > > > > into
> > > > > > > > > > > production-quality code. Even more important, they work
> > > > closely
> > > > > > with
> > > > > > > > > the
> > > > > > > > > > > scientists, feeding back system and user requirements,
> > > which
> > > > > > leads
> > > > > > > > not
> > > > > > > > > only
> > > > > > > > > > > to superior product design, but to new science as well.
> > A
> > > > > > number of
> > > > > > > > > > > scientific papers our members have published (see
> above)
> > > is a
> > > > > > direct
> > > > > > > > > result
> > > > > > > > > > > of this close collaboration.
> > > > > > > > > > >
> > > > > > > > > > > Because sketches are small they can be processed
> > extremely
> > > > > fast,
> > > > > > > > often
> > > > > > > > > many
> > > > > > > > > > > orders-of-magnitude faster than traditional exact
> > > > computations.
> > > > > > For
> > > > > > > > > > > interactive queries there may not be other viable
> > > > alternatives,
> > > > > > and
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > case of real-time analysis, sketches are the only known
> > > > > solution.
> > > > > > > > > > >
> > > > > > > > > > > For any system that needs to extract useful information
> > > from
> > > > > > massive
> > > > > > > > > data
> > > > > > > > > > > sketches are essential tools that should be tightly
> > > > integrated
> > > > > > into
> > > > > > > > the
> > > > > > > > > > > system’s analysis capabilities. This technology has
> > helped
> > > > > Yahoo
> > > > > > > > > > > successfully reduce data processing times from days to
> > > hours
> > > > or
> > > > > > > > > minutes on
> > > > > > > > > > > a number of its internal platforms and has enabled
> > > subsecond
> > > > > > queries
> > > > > > > > on
> > > > > > > > > > > real-time platforms that would have been infeasible
> > without
> > > > > > sketches.
> > > > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > > > Other open source implementations of sketch algorithms
> > can
> > > be
> > > > > > found
> > > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > Internet. However, we have not yet found any open
> source
> > > > > > > > > implementations
> > > > > > > > > > > that are as comprehensive, engineered with the quality
> > > > required
> > > > > > for
> > > > > > > > > > > production systems, and with usable and guaranteed
> error
> > > > > > properties.
> > > > > > > > > Large
> > > > > > > > > > > Internet companies, such as Google and Facebook, have
> > > > published
> > > > > > > > papers
> > > > > > > > > on
> > > > > > > > > > > sketching, however, their implementations of their
> > > published
> > > > > > > > > algorithms are
> > > > > > > > > > > proprietary and not available as open source.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library already provides integrations
> > > with a
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > major Apache data processing platforms such as Apache
> > Hive,
> > > > > > Apache
> > > > > > > > Pig,
> > > > > > > > > > > Apache Spark and Apache Druid, and is also integrated
> > with
> > > a
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > other open source data processing platforms such as
> > Splice
> > > > > > Machine,
> > > > > > > > > GCHQ
> > > > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that having DataSketches as an Apache
> project
> > > will
> > > > > > provide
> > > > > > > > > an
> > > > > > > > > > > immediate, worthwhile, and substantial contribution to
> > the
> > > > open
> > > > > > > > source
> > > > > > > > > > > community, will have a better opportunity to provide a
> > > > > meaningful
> > > > > > > > > > > contribution to both the science and engineering of
> > > sketching
> > > > > > > > > algorithms,
> > > > > > > > > > > and integrate with other Apache projects.  In addition,
> > > this
> > > > > is a
> > > > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > > > destination
> > > > > > for
> > > > > > > > > users
> > > > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > > > >
> > > > > > > > > > > == Initial Goals ==
> > > > > > > > > > >
> > > > > > > > > > > We are breaking our initial goals into short-term (2-6
> > > > months)
> > > > > > and
> > > > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > > > >
> > > > > > > > > > > Our short-term goals include:
> > > > > > > > > > >
> > > > > > > > > > > * Understanding and adapting to the Apache development
> > > > process
> > > > > > and
> > > > > > > > > > > structures.
> > > > > > > > > > >
> > > > > > > > > > > * Start refactoring codebase and move various
> > DataSketches
> > > > > > > > repositories
> > > > > > > > > > > code to Apache Git repository.
> > > > > > > > > > >
> > > > > > > > > > > * Continue development of new features, functions, and
> > > fixes.
> > > > > > > > > > >
> > > > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> > > continue
> > > > to
> > > > > > be
> > > > > > > > > > > developed and expanded.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The intermediate to long term goals include:
> > > > > > > > > > >
> > > > > > > > > > > * Completing the design and implementation of the C++
> > > > sketches
> > > > > to
> > > > > > > > > > > complement what is already available in Java, and the
> > > Python
> > > > > > wrappers
> > > > > > > > > of
> > > > > > > > > > > those C++ sketches.
> > > > > > > > > > >
> > > > > > > > > > > * Expanding the C++ build framework to include Windows
> > and
> > > > the
> > > > > > > > popular
> > > > > > > > > > > Linux variants.
> > > > > > > > > > >
> > > > > > > > > > > * Continued engagement with the scientific research
> > > community
> > > > > on
> > > > > > the
> > > > > > > > > > > development of new algorithms for computationally
> > difficult
> > > > > > problems
> > > > > > > > > that
> > > > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > > > >
> > > > > > > > > > > == Current Status ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches GitHub project has been quite
> > successful.
> > > > As
> > > > > of
> > > > > > > > this
> > > > > > > > > > > writing (Feb, 2019) the number of downloads measured by
> > the
> > > > > Nexus
> > > > > > > > > > > Repository Manager at https://oss.sonatype.org has
> grown
> > > by
> > > > > > nearly a
> > > > > > > > > > > factor
> > > > > > > > > > > of 10 over the past year to about 55 thousand per
> month.
> > > The
> > > > > > > > > > > DataSketches/sketches-core repository has about 560
> stars
> > > and
> > > > > 141
> > > > > > > > > forks,
> > > > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > > > >
> > > > > > > > > > > === Development Practices ===
> > > > > > > > > > >
> > > > > > > > > > > ==== Source Control ====
> > > > > > > > > > >
> > > > > > > > > > > All of our developers have extensive experience with
> Git
> > > > > version
> > > > > > > > > control
> > > > > > > > > > > and follow accepted practices for use of Pull Requests
> > > (PRs),
> > > > > > code
> > > > > > > > > reviews
> > > > > > > > > > > and commits to master, for example.
> > > > > > > > > > >
> > > > > > > > > > > ==== Testing ====
> > > > > > > > > > >
> > > > > > > > > > > Sketches, by their nature are probabilistic programs
> and
> > > > don’t
> > > > > > > > > necessarily
> > > > > > > > > > > behave deterministically.  For some of the sketches we
> > > > > > intentionally
> > > > > > > > > insert
> > > > > > > > > > > random noise into the code as this gives us the
> > > mathematical
> > > > > > > > properties
> > > > > > > > > > > that we need to guarantee accuracy.  This can make the
> > > > behavior
> > > > > > of
> > > > > > > > > these
> > > > > > > > > > > algorithms quite unintuitive and provides significant
> > > > > challenges
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > developer who wishes to test these algorithms for
> > > > correctness.
> > > > > > As a
> > > > > > > > > result,
> > > > > > > > > > > our testing strategy includes two major components:
> unit
> > > > tests,
> > > > > > and
> > > > > > > > > > > characterization tests.
> > > > > > > > > > >
> > > > > > > > > > > ===== Unit Testing =====
> > > > > > > > > > >
> > > > > > > > > > > Our unit tests are primarily quick tests to make sure
> > that
> > > we
> > > > > > > > exercise
> > > > > > > > > all
> > > > > > > > > > > critical paths in the code and that key branches are
> > > executed
> > > > > > > > > correctly. It
> > > > > > > > > > > is important that they execute relatively fast as they
> > are
> > > > > > generally
> > > > > > > > > run on
> > > > > > > > > > > every code build. The sketches-core repository alone
> has
> > > > about
> > > > > 22
> > > > > > > > > thousand
> > > > > > > > > > > statements, over 1300 unit tests and code coverage of
> > about
> > > > > > 98.2% as
> > > > > > > > > > > measured by Atlassian/Clover.  It is our goal for all
> of
> > > our
> > > > > code
> > > > > > > > > > > repositories that are used in production that they have
> > > code
> > > > > > coverage
> > > > > > > > > > > greater than 90%.
> > > > > > > > > > >
> > > > > > > > > > > ===== Characterization Testing =====
> > > > > > > > > > >
> > > > > > > > > > > In order to test the probabilistic methods that are
> used
> > to
> > > > > > interpret
> > > > > > > > > the
> > > > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > > > characterization
> > > > > > > > > > > repository that is dedicated to this.  To measure
> > accuracy,
> > > > for
> > > > > > > > > example,
> > > > > > > > > > > requires running thousands of trials at each of many
> > > > different
> > > > > > points
> > > > > > > > > along
> > > > > > > > > > > the domain axis. Each trial compares its estimated
> > results
> > > > > > against a
> > > > > > > > > known
> > > > > > > > > > > exact result producing an error for that trial.  These
> > > error
> > > > > > > > > measurements
> > > > > > > > > > > are then fed into our Quantiles sketch to capture the
> > > actual
> > > > > > > > > distribution
> > > > > > > > > > > of error at that point along the axis. We then select
> > > > quantile
> > > > > > > > contours
> > > > > > > > > > > across all the distributions at points along the axis.
> > > These
> > > > > > > > contours
> > > > > > > > > can
> > > > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > > > distribution.
> > > > > > > > > These
> > > > > > > > > > > distributions are not at all Gaussian, in fact they can
> > be
> > > > > quite
> > > > > > > > > complex.
> > > > > > > > > > > Nonetheless, these distributions are then checked
> against
> > > our
> > > > > > > > > statistical
> > > > > > > > > > > guarantees inherent to the specific sketch algorithm
> and
> > > its
> > > > > > > > > parameters.
> > > > > > > > > > > There are many examples of these characterization error
> > > > > > distributions
> > > > > > > > > on
> > > > > > > > > > > our website. The runtimes of these tests can be very
> long
> > > and
> > > > > can
> > > > > > > > range
> > > > > > > > > > > from many minutes to hours, and some can run for days.
> > > > > > Currently, we
> > > > > > > > > have
> > > > > > > > > > > separate characterization repositories for Java and
> C++ /
> > > > > Python.
> > > > > > > > > > >
> > > > > > > > > > > It is our goal that we perform this characterization
> > > analysis
> > > > > > for all
> > > > > > > > > of
> > > > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > > > characterization
> > > > > > > > > > > tests is open-source so others can run these tests as
> > well.
> > > > We
> > > > > > do
> > > > > > > > not
> > > > > > > > > have
> > > > > > > > > > > formal releases of this code (because it is not
> > production
> > > > > code)
> > > > > > and
> > > > > > > > > it is
> > > > > > > > > > > not published to Maven Central.
> > > > > > > > > > >
> > > > > > > > > > > === Meritocracy ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches was initially developed based on
> > requirements
> > > > > within
> > > > > > > > > Yahoo. As
> > > > > > > > > > > a project on GitHub, DataSketches has received
> > > contributions
> > > > > from
> > > > > > > > > numerous
> > > > > > > > > > > individual developers from around the world, dedicated
> > > > research
> > > > > > work
> > > > > > > > > from
> > > > > > > > > > > senior scientists at Amazon and Visa, and academic
> > > > researchers
> > > > > > from
> > > > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > > > >
> > > > > > > > > > > As a project under incubation, we are committed to
> > > expanding
> > > > > our
> > > > > > > > > effort to
> > > > > > > > > > > build an environment which supports a meritocracy. We
> are
> > > > > > focused on
> > > > > > > > > > > engaging the community and other related projects for
> > > support
> > > > > and
> > > > > > > > > > > contributions. Moreover, we are committed to ensure
> > > > > contributors
> > > > > > and
> > > > > > > > > > > committers to DataSketches come from a broad mix of
> > > > > organizations
> > > > > > > > > through a
> > > > > > > > > > > merit-based decision process during incubation. We
> > believe
> > > > > > strongly
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > DataSketches premise that fulfills the concept of a
> well
> > > > > > engineered
> > > > > > > > and
> > > > > > > > > > > scientifically rigorous library that implements these
> > > > powerful
> > > > > > > > > algorithms
> > > > > > > > > > > and are committed to growing an inclusive community of
> > > > > > DataSketches
> > > > > > > > > > > contributors and users.
> > > > > > > > > > >
> > > > > > > > > > > === Community ===
> > > > > > > > > > >
> > > > > > > > > > > Yahoo has a long history and active engagement in the
> > Open
> > > > > Source
> > > > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> > > Moloch,
> > > > > > > > Panoptes,
> > > > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > > > TensorFlowOnSpark,
> > > > > > > > > gifshot,
> > > > > > > > > > > fluxible, as well as the creation, contribution and
> > > > incubation
> > > > > of
> > > > > > > > many
> > > > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > > > Oozie,
> > > > > > > > > Zookeeper,
> > > > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many
> > more.
> > > > > > > > > > >
> > > > > > > > > > > Every day, DataSketches is actively used by a
> > organizations
> > > > and
> > > > > > > > > > > institutions around the world for batch and stream
> > > processing
> > > > > of
> > > > > > > > data.
> > > > > > > > > We
> > > > > > > > > > > believe acceptance will allow us to consolidate
> existing
> > > > > > > > > > > DataSketches-related work, grow the DataSketches
> > community,
> > > > and
> > > > > > > > deepen
> > > > > > > > > > > connections between DataSketches and other open source
> > > > > projects.
> > > > > > > > > > >
> > > > > > > > > > > === Introduction to the Core Developers & Contributors
> > ===
> > > > > > > > > > >
> > > > > > > > > > > The core developers and contributors for DataSketches
> are
> > > > from
> > > > > > > > diverse
> > > > > > > > > > > backgrounds, but primarily are scientists that love
> > > > engineering
> > > > > > and
> > > > > > > > > > > engineers that love science. A large part of the value
> we
> > > > bring
> > > > > > comes
> > > > > > > > > from
> > > > > > > > > > > this synthesis.  These individuals have already
> > contributed
> > > > > > > > > substantially
> > > > > > > > > > > to the code, algorithms, and/or mathematical proofs
> that
> > > form
> > > > > the
> > > > > > > > > basis of
> > > > > > > > > > > the library.
> > > > > > > > > > >
> > > > > > > > > > > This core group also form the Initial Committers with
> > write
> > > > > > > > > permissions to
> > > > > > > > > > > the repository. Those marked with (*) Meet weekly to
> plan
> > > the
> > > > > > > > research
> > > > > > > > > and
> > > > > > > > > > > engineering direction of the project.
> > > > > > > > > > >
> > > > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > > > >
> > > > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > > > Israel.
> > > > > > > > > Interests:
> > > > > > > > > > > distributed systems, scalable systems and platforms for
> > big
> > > > > data
> > > > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > > > >
> > > > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist,
> Yahoo
> > > > Labs,
> > > > > > > > > Sunnyvale,
> > > > > > > > > > > California. Interests: algorithms, theoretical and
> > applied
> > > > > > > > mathematics,
> > > > > > > > > > > encoding and compression theory, theoretical and
> applied
> > > > > > performance
> > > > > > > > > > > optimization.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon
> > AI
> > > > > Labs,
> > > > > > Palo
> > > > > > > > > Alto,
> > > > > > > > > > > California. Manages the algorithms group at Amazon AI.
> We
> > > > build
> > > > > > > > > scalable
> > > > > > > > > > > machine learning systems and algorithms which are used
> > both
> > > > > > > > internally
> > > > > > > > > and
> > > > > > > > > > > externally by customers of SageMaker, AWS's flagship
> > > machine
> > > > > > learning
> > > > > > > > > > > platform.
> > > > > > > > > > >
> > > > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs,
> > Sunnyvale.
> > > > > > Interests:
> > > > > > > > > > > Computational advertising, machine learning, speech
> > > > > recognition,
> > > > > > > > > > > data-driven analysis, large scale experimentation, big
> > > data,
> > > > > > > > > stream/complex
> > > > > > > > > > > event processing
> > > > > > > > > > >
> > > > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Georgetown University, Washington D.C. Interests:
> > > algorithms
> > > > > and
> > > > > > > > > > > computational complexity, complexity theory, quantum
> > > > > algorithms,
> > > > > > > > > private
> > > > > > > > > > > data analysis, and learning theory, developing
> efficient
> > > > > > streaming
> > > > > > > > and
> > > > > > > > > > > sketching algorithms
> > > > > > > > > > >
> > > > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > > > >
> > > > > > > > > > > * Roman Leventov: Senior Software Engineer,
> Metamarkets
> > /
> > > > > Snap.
> > > > > > > > > Interests:
> > > > > > > > > > > design and implementation of data storing and data
> > > processing
> > > > > > > > > (distributed)
> > > > > > > > > > > systems, performance optimization, CPU performance,
> > > > mechanical
> > > > > > > > > sympathy,
> > > > > > > > > > > JVM performance, API design, databases, (concurrent)
> data
> > > > > > structures,
> > > > > > > > > > > memory management, garbage collection algorithms,
> > language
> > > > > > design and
> > > > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > > > efficiency,
> > > > > > > > > Linux,
> > > > > > > > > > > code quality, code transformation, pure functional
> > > > programming
> > > > > > > > models,
> > > > > > > > > > > Haskell.
> > > > > > > > > > >
> > > > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead
> developer
> > > and
> > > > > > founder
> > > > > > > > > of
> > > > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > > > Interests:
> > > > > > > > > > > streaming algorithms, mathematics, computer science,
> high
> > > > > > quality and
> > > > > > > > > high
> > > > > > > > > > > performance code for the analysis of massive data,
> > bridging
> > > > the
> > > > > > > > divide
> > > > > > > > > > > between theory and practice.
> > > > > > > > > > >
> > > > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer,
> > Yahoo,
> > > > > > Sunnyvale,
> > > > > > > > > > > California. Interests: applied mathematics, computer
> > > science,
> > > > > big
> > > > > > > > data,
> > > > > > > > > > > distributed systems.
> > > > > > > > > > >
> > > > > > > > > > > === Introduction to Additional Interested Contributors
> > ===
> > > > > > > > > > >
> > > > > > > > > > > These folks have been intermittently involved and
> > > > contributed,
> > > > > > but
> > > > > > > > are
> > > > > > > > > > > strong supporters of this project.
> > > > > > > > > > >
> > > > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > > > >
> > > > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > > > matrix
> > > > > > > > > > > approximation, streaming algorithms, randomized linear
> > > > algebra.
> > > > > > > > > > >
> > > > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot
> com]
> > > > Ph.D.
> > > > > > > > > Computer
> > > > > > > > > > > Science, Research Instructor, Princeton University.
> > > > Interests:
> > > > > > > > > algorithmic
> > > > > > > > > > > foundations of data science and machine learning,
> > efficient
> > > > > > methods
> > > > > > > > for
> > > > > > > > > > > processing and understanding large datasets, often
> > working
> > > at
> > > > > the
> > > > > > > > > > > intersection of theoretical computer science, numerical
> > > > linear
> > > > > > > > > algebra, and
> > > > > > > > > > > optimization.
> > > > > > > > > > >
> > > > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk]
> Ph.D.
> > > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Professor, Warwick University, Warwick, England.
> > Interests:
> > > > all
> > > > > > > > > aspects of
> > > > > > > > > > > the "data lifecycle", from data collection and
> cleaning,
> > > > > through
> > > > > > > > > mining and
> > > > > > > > > > > analytics. (Professor Cormode is one of the world’s
> > leading
> > > > > > > > scientists
> > > > > > > > > in
> > > > > > > > > > > sketching algorithms)
> > > > > > > > > > >
> > > > > > > > > > > === Alignment ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library already provides integrations
> > and
> > > > > > example
> > > > > > > > > code for
> > > > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > > > integrated
> > > > > > into
> > > > > > > > > Apache
> > > > > > > > > > > Druid.
> > > > > > > > > > >
> > > > > > > > > > > == Known Risks ==
> > > > > > > > > > >
> > > > > > > > > > > The following subsections are specific risks that have
> > been
> > > > > > > > identified
> > > > > > > > > by
> > > > > > > > > > > the ASF that need to be addressed.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library is presently used by a number
> of
> > > > > > > > > organizations,
> > > > > > > > > > > from small startups to Fortune 100 companies, to
> > construct
> > > > > > production
> > > > > > > > > > > pipelines that must process and analyze massive data.
> > Yahoo
> > > > > has a
> > > > > > > > > long-term
> > > > > > > > > > > commitment to continue to advance the DataSketches
> > library;
> > > > > > moreover,
> > > > > > > > > > > DataSketches is seeing increasing interest,
> development,
> > > and
> > > > > > adoption
> > > > > > > > > from
> > > > > > > > > > > many diverse organizations from around the world. Due
> to
> > > its
> > > > > > growing
> > > > > > > > > > > adoption, we feel it is quite unlikely that this
> project
> > > > would
> > > > > > become
> > > > > > > > > > > orphaned.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > > > >
> > > > > > > > > > > Yahoo believes strongly in open source and the exchange
> > of
> > > > > > > > information
> > > > > > > > > to
> > > > > > > > > > > advance new ideas and work. Examples of this commitment
> > are
> > > > > > active
> > > > > > > > open
> > > > > > > > > > > source projects such as those mentioned above. With
> > > > > > DataSketches, we
> > > > > > > > > have
> > > > > > > > > > > been increasingly open and forward-looking; we have
> > > > published a
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > papers about breakthrough developments in the science
> of
> > > > > > streaming
> > > > > > > > > > > algorithms (mentioned above) that also reference the
> > > > > DataSketches
> > > > > > > > > library.
> > > > > > > > > > > Our submission to the Apache Software Foundation is a
> > > logical
> > > > > > > > > extension of
> > > > > > > > > > > our commitment to open source software.
> > > > > > > > > > >
> > > > > > > > > > > Key committers at Yahoo with strong open source
> > backgrounds
> > > > > > include
> > > > > > > > > Aaron
> > > > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > > > Braginsky,
> > > > > > > > Andrews
> > > > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen,
> > Bryan
> > > > > Call,
> > > > > > > > Daryn
> > > > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric
> Payne,
> > > > Eshcar
> > > > > > > > Hillel,
> > > > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > > > Perez-Sorrosal,
> > > > > > > > > Gil
> > > > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai
> > Asher,
> > > > > James
> > > > > > > > > Penick,
> > > > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis,
> Jon
> > > > > Eagles,
> > > > > > > > > Kihwal
> > > > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla,
> > Michael
> > > > > > Trelinski,
> > > > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham,
> Olga
> > L.
> > > > > > > > Natkovich,
> > > > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini
> Palaniswamy,
> > > > Ruby
> > > > > > Loo,
> > > > > > > > > Ryan
> > > > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley,
> Shu
> > > Kit
> > > > > > Chan,
> > > > > > > > Sri
> > > > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and
> > many
> > > > > more.
> > > > > > > > > > >
> > > > > > > > > > > All of our core developers are committed to learn about
> > the
> > > > > > Apache
> > > > > > > > > process
> > > > > > > > > > > and to give back to the community.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > > > >
> > > > > > > > > > > The majority of committers in this proposal belong to
> > Yahoo
> > > > due
> > > > > > to
> > > > > > > > the
> > > > > > > > > fact
> > > > > > > > > > > that DataSketches has emerged from an internal Yahoo
> > > project.
> > > > > > This
> > > > > > > > > proposal
> > > > > > > > > > > also includes developers and contributors from other
> > > > companies,
> > > > > > and
> > > > > > > > > who are
> > > > > > > > > > > actively involved with other Apache projects, such as
> > > Druid.
> > > > > We
> > > > > > > > > expect our
> > > > > > > > > > > entry into incubation will allow us to expand the
> number
> > of
> > > > > > > > > individuals and
> > > > > > > > > > > organizations participating in DataSketches
> development.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > > > >
> > > > > > > > > > > Because the DataSketches library originated within
> Yahoo,
> > > it
> > > > > has
> > > > > > been
> > > > > > > > > > > developed primarily by salaried Yahoo developers and we
> > > > expect
> > > > > > that
> > > > > > > > to
> > > > > > > > > > > continue to be the case near term. However, since we
> > placed
> > > > > this
> > > > > > > > > library
> > > > > > > > > > > into open-source we have had a number of significant
> > > > > > contributions
> > > > > > > > from
> > > > > > > > > > > engineers and scientists from outside of Yahoo. We
> expect
> > > our
> > > > > > > > reliance
> > > > > > > > > on
> > > > > > > > > > > Yahoo salaried developers will decrease over time.
> > > > Nonetheless,
> > > > > > Yahoo
> > > > > > > > > is
> > > > > > > > > > > committed to continue its strong support of this
> > important
> > > > > > project.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Lack of Relationship to other Apache Products
> > ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches already directly interoperates with or
> > > utilizes
> > > > > > several
> > > > > > > > > > > existing Apache projects.
> > > > > > > > > > >
> > > > > > > > > > > * Build
> > > > > > > > > > >    * Apache Maven
> > > > > > > > > > >
> > > > > > > > > > > * Integrations and adaptors for the following projects
> > > > > naturally
> > > > > > have
> > > > > > > > > them
> > > > > > > > > > > as dependencies
> > > > > > > > > > >    * Apache Hive
> > > > > > > > > > >    * Apache Pig
> > > > > > > > > > >    * Apache Druid
> > > > > > > > > > >    * Apache Spark
> > > > > > > > > > >
> > > > > > > > > > > * Additional dependencies for the above integrations
> and
> > > > > adaptors
> > > > > > > > > include
> > > > > > > > > > >    * Apache Hadoop
> > > > > > > > > > >    * Apache Commons (Math)
> > > > > > > > > > >
> > > > > > > > > > > There is no other Apache project that we are aware of
> > that
> > > > > > duplicates
> > > > > > > > > the
> > > > > > > > > > > functionality of the DataSketches library.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: An Excessive Fascination with the Apache
> Brand
> > > ===
> > > > > > > > > > >
> > > > > > > > > > > With this proposal we are not seeking attention or
> > > publicity.
> > > > > > Rather,
> > > > > > > > > we
> > > > > > > > > > > firmly believe in the DataSketches library and concept
> > and
> > > > the
> > > > > > > > ability
> > > > > > > > > to
> > > > > > > > > > > make the DataSketches library a powerful, yet
> > simple-to-use
> > > > > > toolkit
> > > > > > > > for
> > > > > > > > > > > data processing. While the DataSketches library has
> been
> > > open
> > > > > > source,
> > > > > > > > > we
> > > > > > > > > > > believe putting code on GitHub can only go so far. We
> see
> > > the
> > > > > > Apache
> > > > > > > > > > > community, processes, and mission as critical for
> > ensuring
> > > > the
> > > > > > > > > DataSketches
> > > > > > > > > > > library is truly community-driven, positively
> impactful,
> > > and
> > > > > > > > innovative
> > > > > > > > > > > open source software. While Yahoo has taken a number of
> > > steps
> > > > > to
> > > > > > > > > advance
> > > > > > > > > > > its various open source projects, we believe the
> > > DataSketches
> > > > > > library
> > > > > > > > > > > project is a great fit for the Apache Software
> Foundation
> > > due
> > > > > to
> > > > > > its
> > > > > > > > > focus
> > > > > > > > > > > on data processing and its relationships to existing
> ASF
> > > > > > projects.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Cryptography ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches does not contain any cryptographic code
> and
> > is
> > > > > not a
> > > > > > > > > > > cryptographic product.
> > > > > > > > > > >
> > > > > > > > > > > == Documentation ==
> > > > > > > > > > >
> > > > > > > > > > > The following documentation is relevant to this
> proposal.
> > > > > > Relevant
> > > > > > > > > portions
> > > > > > > > > > > of the documentation will be contributed to the Apache
> > > > > > DataSketches
> > > > > > > > > > > project.
> > > > > > > > > > >
> > > > > > > > > > > * DataSketches website: https://datasketches.github.io
> .
> > > > > > > > > > >
> > > > > > > > > > > * DataSketches website repository:
> > > > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > > > >
> > > > > > > > > > > We will need an apache website for this documentation
> > > similar
> > > > > to
> > > > > > > > > > >
> > > > > > > > > > > * https://datasketches.apache.org
> > > > > > > > > > >
> > > > > > > > > > > == Initial Source ==
> > > > > > > > > > >
> > > > > > > > > > > The initial source for DataSketches which we will
> submit
> > to
> > > > the
> > > > > > > > Apache
> > > > > > > > > > > Foundation will include a number of repositories which
> > are
> > > > > > currently
> > > > > > > > > hosted
> > > > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > > > >
> > > > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > > > >
> > > > > > > > > > > * Java
> > > > > > > > > > >    * sketches-core: This repository has the core
> > sketching
> > > > > > classes,
> > > > > > > > > which
> > > > > > > > > > > are leveraged by some of the other repositories. This
> > > > > repository
> > > > > > has
> > > > > > > > no
> > > > > > > > > > > external dependencies outside of the
> DataSketches/memory
> > > > > > repository,
> > > > > > > > > Java
> > > > > > > > > > > and TestNG for unit tests. This code is versioned and
> the
> > > > > latest
> > > > > > > > > release
> > > > > > > > > > > can be obtained from Maven Central.
> > > > > > > > > > >    * memory: Low level, high-performance memory
> > > > data-structure
> > > > > > > > > management
> > > > > > > > > > > primarily for off-heap.
> > > > > > > > > > >    * sketches-android: This is a new repository
> dedicated
> > > to
> > > > > > sketches
> > > > > > > > > > > designed to be run in a mobile client, such as a cell
> > > phone.
> > > > It
> > > > > > is
> > > > > > > > > still in
> > > > > > > > > > > development and should be considered experimental.
> > > > > > > > > > >    * sketches-hive: This repository contains Hive UDFs
> > and
> > > > > UDAFs
> > > > > > for
> > > > > > > > > use
> > > > > > > > > > > within Hadoop grid environments. This code has
> > dependencies
> > > > on
> > > > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> > > code
> > > > > are
> > > > > > > > > advised to
> > > > > > > > > > > use Maven to bring in all the required dependencies.
> This
> > > > code
> > > > > is
> > > > > > > > > versioned
> > > > > > > > > > > and the latest release can be obtained from Maven
> > Central.
> > > > > > > > > > >    * sketches-pig: This repository contains Pig User
> > > Defined
> > > > > > > > Functions
> > > > > > > > > > > (UDF) for use within Hadoop grid environments. This
> code
> > > has
> > > > > > > > > dependencies
> > > > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of
> this
> > > > code
> > > > > > are
> > > > > > > > > advised
> > > > > > > > > > > to use Maven to bring in all the required dependencies.
> > > This
> > > > > > code is
> > > > > > > > > > > versioned and the latest release can be obtained from
> > Maven
> > > > > > Central.
> > > > > > > > > > >    * sketches-vector: This is a new repository
> dedicated
> > to
> > > > > > sketches
> > > > > > > > > for
> > > > > > > > > > > vector and matrix operations. It is still somewhat
> > > > > experimental.
> > > > > > > > > > >    * characterization: This relatively new repository
> is
> > > for
> > > > > code
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > use to characterize the accuracy and speed performance
> of
> > > the
> > > > > > > > sketches
> > > > > > > > > in
> > > > > > > > > > > the library and is constantly being updated. Examples
> of
> > > the
> > > > > job
> > > > > > > > > command
> > > > > > > > > > > files used for various tests can be found in the
> > > > > > src/main/resources
> > > > > > > > > > > directory. Some of these tests can run for hours
> > depending
> > > on
> > > > > its
> > > > > > > > > > > configuration.
> > > > > > > > > > >    * experimental: This repository is an experimental
> > > staging
> > > > > > area
> > > > > > > > for
> > > > > > > > > code
> > > > > > > > > > > that will eventually end up in another repository. This
> > > code
> > > > is
> > > > > > not
> > > > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > > > production
> > > > > > > > > > > deployment
> > > > > > > > > > >
> > > > > > > > > > > * C++ and Python
> > > > > > > > > > >    * sketches-core-cpp: This is the C++/Python
> companion
> > to
> > > > the
> > > > > > Java
> > > > > > > > > > > sketches-core. These implementations are binary
> > compatible
> > > > with
> > > > > > their
> > > > > > > > > > > counterparts in Java. In other words, a sketch created
> > and
> > > > > > stored in
> > > > > > > > > C++
> > > > > > > > > > > can be opened and read in Java and visa-versa. This
> site
> > > also
> > > > > > has our
> > > > > > > > > > > Python adaptors that basically wrap the C++
> > > implementations,
> > > > > > making
> > > > > > > > the
> > > > > > > > > > > high performance C++ implementations available from
> > Python.
> > > > > > > > > > >    * sketches-postgres: This site provides the
> > > > > postgres-specific
> > > > > > > > > adaptors
> > > > > > > > > > > that wrap the C++ implementations making them available
> > to
> > > > the
> > > > > > > > Postgres
> > > > > > > > > > > database users.
> > > > > > > > > > >    * characterization-cpp: This is the C++/Python
> > companion
> > > > to
> > > > > > the
> > > > > > > > Java
> > > > > > > > > > > characterization repository.
> > > > > > > > > > >    * experimental-cpp: This repository is an
> experimental
> > > > > staging
> > > > > > > > area
> > > > > > > > > for
> > > > > > > > > > > C++ code that will eventually end up in another
> > repository.
> > > > > > > > > > >
> > > > > > > > > > > * Command-Line Tools
> > > > > > > > > > >    * sketches-cmd
> > > > > > > > > > >    * homebrew-sketches
> > > > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > > > >
> > > > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > > > intend
> > > > > to
> > > > > > > > > bundle
> > > > > > > > > > > all of these repositories since they are all
> > complementary
> > > > and
> > > > > > should
> > > > > > > > > be
> > > > > > > > > > > maintaine

-- 
From my cell phone.

Re: DataSketches Proposal - Google Docs Link

Posted by Kenneth Knowles <ke...@apache.org>.

Done. Sorry for that oversight.

Kenn

On Mon, Feb 25, 2019 at 10:01 PM leerho@gmail.com <le...@gmail.com> wrote:

> As primary author, can I be given the ability to directly edit?
> On 2019/02/26 05:37:22, Kenneth Knowles <ke...@apache.org> wrote:
> > It isn't too much work, so I've done it:
> > https://s.apache.org/datasketches-proposal-draft
> >
> > Kenn
> >
> > On Mon, Feb 25, 2019 at 9:31 PM leerho <le...@gmail.com> wrote:
> >
> > > Yes, I thought of that.  But it’s not like I’m being overwhelmed with
> > > requests to comment ... so far it has been only 3 or 4, and the
> requested
> > > changes have been minor.  I’m assuming that if there are no more
> > > substantive changes after this week that the document would be moved
> to the
> > > wiki archive, where, I presume, changes could still be made.
> > >
> > > I want to do the right thing here, so if you feel that the document
> would
> > > get much better feedback on an unrestricted gDoc site, I will set it
> up.
> > >
> > >
> > >
> > > On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jbapple@cloudera.com.invalid
> >
> > > wrote:
> > >
> > > > You could use a Google account that is not under Yahoo’s control,
> then
> > > let
> > > > anyone in the world add a comment, maybe.
> > > >
> > > > On Mon, Feb 25, 2019 at 3:26 PM leerho <le...@gmail.com> wrote:
> > > >
> > > > > Ken,
> > > > > Yahoo does not allow me to create a shared link outside our
> company,
> > > > except
> > > > > to individual email addresses.  So attempting to share it to the
> email
> > > > > general@incubator.apache.org may not work.  Nonetheless, several
> > > > > individuals were able to request access using their individual
> email
> > > > > accounts and I was able to add them.  I will try to add you using
> > > > > kenn@apache.org, but if that doesn't work, I may need a gmail or
> > > > > equivalent
> > > > > account for you.
> > > > >
> > > > > Lee.
> > > > >
> > > > >
> > > > > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > >
> > > > > > I could not access that document. I suggest you need to turn on
> link
> > > > > > sharing.
> > > > > >
> > > > > > Kenn
> > > > > >
> > > > > > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <
> leerho@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Try this link:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > > > > >
> > > > > > >
> > > > > > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > > > > > Yes I will try that tomorrow.
> > > > > > > >
> > > > > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <
> kenn@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Can you share the Google doc with the proposal? Per Ted's
> > > advice,
> > > > > we
> > > > > > > can
> > > > > > > > > iterate quickly there and move it to the wiki when it
> becomes a
> > > > bit
> > > > > > > more
> > > > > > > > > stable.
> > > > > > > > >
> > > > > > > > > Kenn
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> > > > > leerho@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the offer.  i am a neophyte at this process
> and
> > > > email
> > > > > > > app!   I
> > > > > > > > > > could use a lot of help getting this off the ground!
> Also,
> > > I'm
> > > > > not
> > > > > > > sure
> > > > > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking
> this
> > > on
> > > > > :)
> > > > > > > > > >
> > > > > > > > > > Lee.
> > > > > > > > > >
> > > > > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <kenn@apache.org
> >
> > > > wrote:
> > > > > > > > > > > Nice.
> > > > > > > > > > >
> > > > > > > > > > > I would very much like to help mentor this project,
> though
> > > > you
> > > > > > > already
> > > > > > > > > > have
> > > > > > > > > > > a couple good ones.
> > > > > > > > > > >
> > > > > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > > > > >
> > > > > > > > > > > Kenn (VP Apache Beam)
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <
> leerho@gmail.com>
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I didn't realize that this mail list does not accept
> PDF
> > > > > files,
> > > > > > > > > > apparently
> > > > > > > > > > > > only text.  So let me try one more time ... :)
> Please
> > > let
> > > > me
> > > > > > > know if
> > > > > > > > > > > > this works!
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > > > > >
> > > > > > > > > > > > == Abstract ==
> > > > > > > > > > > >
> > > > > > > > > > > > DataSketches.GitHub.io is an open source,
> > > high-performance
> > > > > > > library
> > > > > > > > > of
> > > > > > > > > > > > stochastic streaming algorithms commonly called
> > > "sketches"
> > > > in
> > > > > > the
> > > > > > > > > data
> > > > > > > > > > > > sciences. Sketches are small, stateful programs that
> > > > process
> > > > > > > massive
> > > > > > > > > > data
> > > > > > > > > > > > as a stream and can provide approximate answers, with
> > > > > > > mathematical
> > > > > > > > > > > > guarantees, to computationally difficult queries
> > > > > > > orders-of-magnitude
> > > > > > > > > > faster
> > > > > > > > > > > > than traditional, exact methods.
> > > > > > > > > > > >
> > > > > > > > > > > > This proposal is to move DataSketches to the Apache
> > > > Software
> > > > > > > > > > > > Foundation(ASF) transferring ownership of its
> copyright
> > > > > > > intellectual
> > > > > > > > > > > > property to the ASF.  Thereafter, DataSketches would
> be
> > > > > > > officially
> > > > > > > > > > known as
> > > > > > > > > > > > Apache DataSketches and its evolution and governance
> > > would
> > > > > come
> > > > > > > under
> > > > > > > > > > the
> > > > > > > > > > > > rules and guidance of the ASF.
> > > > > > > > > > > >
> > > > > > > > > > > > == Introduction ==
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library contains carefully crafted
> > > > > > > implementations
> > > > > > > > > of
> > > > > > > > > > > > sketch algorithms that meet rigorous standards of
> quality
> > > > and
> > > > > > > > > > performance
> > > > > > > > > > > > and provide capabilities required for large-scale
> > > > production
> > > > > > > systems
> > > > > > > > > > that
> > > > > > > > > > > > must process and analyze massive data. The
> DataSketches
> > > > core
> > > > > > > > > > repository is
> > > > > > > > > > > > written in Java with a parallel core repository
> written
> > > in
> > > > > C++
> > > > > > > that
> > > > > > > > > > > > includes Python wrappers. The DataSketches library
> also
> > > > > > includes
> > > > > > > > > > special
> > > > > > > > > > > > repositories for extending the core library for
> Apache
> > > Hive
> > > > > and
> > > > > > > > > Apache
> > > > > > > > > > Pig.
> > > > > > > > > > > > The sketches developed in the different languages
> share a
> > > > > > common
> > > > > > > > > binary
> > > > > > > > > > > > storage format so that sketches created and stored in
> > > Java,
> > > > > for
> > > > > > > > > > example,
> > > > > > > > > > > > can be fully used in C++, and visa versa.  Because
> the
> > > > stored
> > > > > > > sketch
> > > > > > > > > > > > "images" are just a "blob" of bytes (similar to
> picture
> > > > > > images),
> > > > > > > they
> > > > > > > > > > can
> > > > > > > > > > > > be shared across many different systems, languages
> and
> > > > > > platforms.
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches documentation website,
> > > > > > > > > https://datasketches.github.io
> > > > > > > > > > ,
> > > > > > > > > > > > includes general tutorials, a comprehensive research
> > > > section
> > > > > > with
> > > > > > > > > > > > references to relevant academic papers, extensive
> > > examples
> > > > > for
> > > > > > > using
> > > > > > > > > > the
> > > > > > > > > > > > core library directly as well as examples for
> accessing
> > > the
> > > > > > > library
> > > > > > > > > in
> > > > > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library also includes a
> characterization
> > > > > > > repository
> > > > > > > > > > for
> > > > > > > > > > > > long running test programs that are used for studying
> > > > > accuracy
> > > > > > > and
> > > > > > > > > > > > performance of these sketches over wide ranges of
> input
> > > > > > > variables.
> > > > > > > > > The
> > > > > > > > > > data
> > > > > > > > > > > > produced by these programs is used for generating the
> > > many
> > > > > > > > > performance
> > > > > > > > > > > > plots contained in the documentation website and for
> > > > academic
> > > > > > > > > > > > publications.
> > > > > > > > > > > >
> > > > > > > > > > > > The code repositories used for production are
> versioned
> > > and
> > > > > > > published
> > > > > > > > > > to
> > > > > > > > > > > > Maven Central on periodic intervals as the library
> > > evolves.
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library also includes several
> > > experimental
> > > > > > > > > > repositories
> > > > > > > > > > > > for use-cases outside the large-scale systems
> > > environments,
> > > > > > such
> > > > > > > as
> > > > > > > > > > > > sketches for mobile, IoT devices (Android),
> command-line
> > > > > access
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > sketch library, and an experimental repository for
> > > > > vector-based
> > > > > > > > > > sketches
> > > > > > > > > > > > that performs approximate Singular Value
> Decomposition
> > > > (SVD)
> > > > > > > analysis
> > > > > > > > > > that
> > > > > > > > > > > > could potentially be used in Machine Learning (ML)
> > > > > > applications.
> > > > > > > > > > > >
> > > > > > > > > > > > == Background ==
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library was started in 2012 as
> internal
> > > > > Yahoo
> > > > > > > > > project
> > > > > > > > > > to
> > > > > > > > > > > > dramatically reduce time and resources required for
> > > > distinct
> > > > > > > (unique)
> > > > > > > > > > > > counting.  An extensive search on the Internet at the
> > > time
> > > > > > > yielded a
> > > > > > > > > > number
> > > > > > > > > > > > of theoretical papers on stochastic streaming
> algorithms
> > > > with
> > > > > > > > > > pseudocode
> > > > > > > > > > > > examples, but we did not find any usable open-source
> code
> > > > of
> > > > > > the
> > > > > > > > > > quality we
> > > > > > > > > > > > felt we needed for our internal production systems.
> So
> > > we
> > > > > > > started a
> > > > > > > > > > small
> > > > > > > > > > > > project (one person) to develop our own sketches
> working
> > > > > > directly
> > > > > > > > > from
> > > > > > > > > > > > published theoretical papers.
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library was designed from the start
> with
> > > > the
> > > > > > > > > > objective of
> > > > > > > > > > > > making these algorithms, usually only described in
> > > > > theoretical
> > > > > > > > > papers,
> > > > > > > > > > > > easily accessible to systems developers for use in
> our
> > > > > internal
> > > > > > > > > > production
> > > > > > > > > > > > systems. By necessity, the code had to be of the
> highest
> > > > > > quality
> > > > > > > and
> > > > > > > > > > > > thoroughly tested. The wide variety of our internal
> > > > > production
> > > > > > > > > systems
> > > > > > > > > > > > drove the requirement that the sketch
> implementations had
> > > > to
> > > > > > > have an
> > > > > > > > > > > > absolute minimum of external, run-time dependencies
> in
> > > > order
> > > > > to
> > > > > > > > > > simplify
> > > > > > > > > > > > integration and troubleshooting.
> > > > > > > > > > > >
> > > > > > > > > > > > Our internal experiments demonstrated dramatic
> positive
> > > > > impact
> > > > > > > on the
> > > > > > > > > > > > performance of our systems.  As a result, the
> > > DataSketches
> > > > > > > library
> > > > > > > > > > quickly
> > > > > > > > > > > > evolved to include different types of sketches for
> > > > different
> > > > > > > types of
> > > > > > > > > > > > queries, such as frequent-items (a.k.a,
> heavy-hitters)
> > > > > > > algorithms,
> > > > > > > > > > > > quantile/histogram algorithms, and weighted and
> > > unweighted
> > > > > > > sampling
> > > > > > > > > > > > algorithms.
> > > > > > > > > > > >
> > > > > > > > > > > > We quickly discovered that developing these sketch
> > > > algorithms
> > > > > > to
> > > > > > > be
> > > > > > > > > > truly
> > > > > > > > > > > > robust in production environments is quite difficult
> and
> > > > > > requires
> > > > > > > > > deep
> > > > > > > > > > > > understanding of the underlying mathematics and
> > > statistics
> > > > as
> > > > > > > well as
> > > > > > > > > > > > extensive experience in developing high quality code
> for
> > > > 24/7
> > > > > > > > > > production
> > > > > > > > > > > > systems. This is a difficult combination of skills
> for
> > > any
> > > > > one
> > > > > > > > > > organization
> > > > > > > > > > > > to collect and maintain over time. It became clear
> that
> > > > this
> > > > > > > > > technology
> > > > > > > > > > > > needed a community larger than Yahoo to evolve.  In
> > > > November,
> > > > > > > 2015,
> > > > > > > > > > this
> > > > > > > > > > > > factor, along with Yahoo’s strong experience and
> support
> > > of
> > > > > > open
> > > > > > > > > > source,
> > > > > > > > > > > > led to the decision to open source this technology
> under
> > > an
> > > > > > > Apache
> > > > > > > > > 2.0
> > > > > > > > > > > > license on GitHub. Since that time our community has
> > > > expanded
> > > > > > > > > > considerably
> > > > > > > > > > > > and the key contributors to this effort includes
> leading
> > > > > > research
> > > > > > > > > > > > scientists from a number of universities as well as
> > > > > > > practitioners and
> > > > > > > > > > > > researchers from a number of major corporations. The
> core
> > > > of
> > > > > > this
> > > > > > > > > > group is
> > > > > > > > > > > > very active as we meet weekly to discuss research
> > > > directions
> > > > > > and
> > > > > > > > > > > > engineering priorities.
> > > > > > > > > > > >
> > > > > > > > > > > > It is important to note that our internal systems at
> > > Yahoo
> > > > > use
> > > > > > > the
> > > > > > > > > > current
> > > > > > > > > > > > public GitHub open source DataSketches library and
> not an
> > > > > > > internal
> > > > > > > > > > version
> > > > > > > > > > > > of the code.
> > > > > > > > > > > >
> > > > > > > > > > > > The close collaboration of scientific research and
> > > > > engineering
> > > > > > > > > > development
> > > > > > > > > > > > experience with actual massive-data processing
> systems
> > > has
> > > > > also
> > > > > > > > > > produced
> > > > > > > > > > > > new research publications in the field of stochastic
> > > > > streaming
> > > > > > > > > > algorithms,
> > > > > > > > > > > > for example:
> > > > > > > > > > > >
> > > > > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo
> > > Liberty,
> > > > > Lee
> > > > > > > > > > Rhodes, and
> > > > > > > > > > > > Justin Thaler. A high-performance algorithm for
> > > identifying
> > > > > > > frequent
> > > > > > > > > > items
> > > > > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > > > > >
> > > > > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and
> Justin
> > > > > > > Thaler. A
> > > > > > > > > > > > framework for estimating stream expression
> cardinalities.
> > > > In
> > > > > > > > > *EDBT/ICDT
> > > > > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > > > > >
> > > > > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips.
> Efficient
> > > > > > > Frequent
> > > > > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM
> SIGKDD
> > > > > > > Proceedings
> > > > > > > > > > ‘16,
> > > > > > > > > > > > pages 845-854, 2016.
> > > > > > > > > > > >
> > > > > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty.
> > > Optimal
> > > > > > > quantile
> > > > > > > > > > > > approximation in streams. In IEEE FOCS Proceedings
> ‘16,
> > > > pages
> > > > > > > 71–78,
> > > > > > > > > > 2016.
> > > > > > > > > > > >
> > > > > > > > > > > > * Kevin J Lang. Back to the future: an even more
> nearly
> > > > > optimal
> > > > > > > > > > cardinality
> > > > > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > > > > 2017.
> > > > > > > > > > > >
> > > > > > > > > > > > * Edo Liberty. Simple and deterministic matrix
> sketching.
> > > > In
> > > > > > ACM
> > > > > > > KDD
> > > > > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > > > > >
> > > > > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler,
> and
> > > > > > Jonathan
> > > > > > > > > > Ullman.
> > > > > > > > > > > > Space lower bounds for itemset frequency sketches.
> In ACM
> > > > > PODS
> > > > > > > > > > Proceedings
> > > > > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > > > > >
> > > > > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin
> > > Thaler.
> > > > > > > > > Hierarchical
> > > > > > > > > > > > heavy hitters with the space saving algorithm. In
> SIAM
> > > > ALENEX
> > > > > > > > > > Proceedings
> > > > > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > > > > >
> > > > > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > > > > >
> > > > > > > > > > > > In the analysis of big data there are often problem
> > > queries
> > > > > > that
> > > > > > > > > don’t
> > > > > > > > > > > > scale because they require huge compute resources and
> > > time
> > > > to
> > > > > > > > > generate
> > > > > > > > > > > > exact results. Examples include count distinct,
> > > quantiles,
> > > > > most
> > > > > > > > > > frequent
> > > > > > > > > > > > items, joins, matrix computations, and graph
> analysis.
> > > > > > > > > > > >
> > > > > > > > > > > > If we can loosen the requirement of “exact” results
> from
> > > > our
> > > > > > > queries
> > > > > > > > > > and be
> > > > > > > > > > > > satisfied with approximate results, within some well
> > > > > understood
> > > > > > > > > bounds
> > > > > > > > > > of
> > > > > > > > > > > > error, there is an entire branch of mathematics and
> data
> > > > > > science
> > > > > > > that
> > > > > > > > > > has
> > > > > > > > > > > > evolved around developing algorithms that can produce
> > > > > > approximate
> > > > > > > > > > results
> > > > > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > > > > >
> > > > > > > > > > > > With the additional requirements that these
> algorithms
> > > must
> > > > > be
> > > > > > > small
> > > > > > > > > > > > (compared to the size of the input data), sublinear
> (the
> > > > size
> > > > > > of
> > > > > > > the
> > > > > > > > > > sketch
> > > > > > > > > > > > must grow at a slower rate than the size of the input
> > > > > stream),
> > > > > > > > > > streaming
> > > > > > > > > > > > (they can only touch each data item once), and
> mergeable
> > > > > > > (suitable
> > > > > > > > > for
> > > > > > > > > > > > distributed processing), defines a class of
> algorithms
> > > that
> > > > > can
> > > > > > > be
> > > > > > > > > > > > described as small, stochastic, streaming, sublinear
> > > > > mergeable
> > > > > > > > > > algorithms,
> > > > > > > > > > > > commonly called sketches (they also have other
> names, but
> > > > we
> > > > > > > will use
> > > > > > > > > > the
> > > > > > > > > > > > term sketches from here on).
> > > > > > > > > > > >
> > > > > > > > > > > > To be truly streaming and be able to process data in
> a
> > > > single
> > > > > > > pass,
> > > > > > > > > > > > sketches must make absolute minimum assumptions
> about the
> > > > > input
> > > > > > > > > stream.
> > > > > > > > > > > > This is critically important, as there is no “second
> > > > chance”
> > > > > to
> > > > > > > > > > process the
> > > > > > > > > > > > data.
> > > > > > > > > > > >
> > > > > > > > > > > > For example, sketches should not make assumptions
> about
> > > the
> > > > > > > order of
> > > > > > > > > > stream
> > > > > > > > > > > > items, the stream length, the dynamic range of
> values, or
> > > > the
> > > > > > > > > > distribution
> > > > > > > > > > > > of item occurrence frequencies. Sketches should be
> > > tolerant
> > > > > of
> > > > > > > NaNs,
> > > > > > > > > > Nulls
> > > > > > > > > > > > and empty objects. About the only thing that the
> sketch
> > > > needs
> > > > > > to
> > > > > > > know
> > > > > > > > > > about
> > > > > > > > > > > > the stream is how to extract items from it and what
> type
> > > > the
> > > > > > > item is,
> > > > > > > > > > e.g.,
> > > > > > > > > > > > is it a numeric value or a string.
> > > > > > > > > > > >
> > > > > > > > > > > > As far as the sketch is concerned, the input stream
> is a
> > > > > > > sequence of
> > > > > > > > > > items
> > > > > > > > > > > > in some unknown random order with unknown random
> values.
> > > > > > > > > > > >
> > > > > > > > > > > > The sketch is essentially a complex state machine and
> > > > > combined
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > random input stream defines a stochastic process. We
> then
> > > > > apply
> > > > > > > > > > > > probabilistic methods to interpret the states of the
> > > > > stochastic
> > > > > > > > > > process in
> > > > > > > > > > > > order to extract useful information about the input
> > > stream
> > > > > > > itself.
> > > > > > > > > The
> > > > > > > > > > > > resulting information will be approximate, but we
> also
> > > use
> > > > > > > additional
> > > > > > > > > > > > probabilistic methods to extract an estimate of the
> > > likely
> > > > > > > > > probability
> > > > > > > > > > > > distribution of error.
> > > > > > > > > > > >
> > > > > > > > > > > > There is a significant scientific contribution here
> that
> > > is
> > > > > > > defining
> > > > > > > > > > the
> > > > > > > > > > > > state machine, understanding the resulting stochastic
> > > > > process,
> > > > > > > > > > developing
> > > > > > > > > > > > the probabilistic methods, and proving
> mathematically,
> > > that
> > > > > it
> > > > > > > all
> > > > > > > > > > works!
> > > > > > > > > > > > This is why the scientific contributors to this
> project
> > > > are a
> > > > > > > > > critical
> > > > > > > > > > and
> > > > > > > > > > > > strategic component to our success.  The development
> > > > > engineers
> > > > > > > > > > translate
> > > > > > > > > > > > the concepts of the proposed state machine and
> > > > probabilistic
> > > > > > > methods
> > > > > > > > > > into
> > > > > > > > > > > > production-quality code. Even more important, they
> work
> > > > > closely
> > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > > scientists, feeding back system and user
> requirements,
> > > > which
> > > > > > > leads
> > > > > > > > > not
> > > > > > > > > > only
> > > > > > > > > > > > to superior product design, but to new science as
> well.
> > > A
> > > > > > > number of
> > > > > > > > > > > > scientific papers our members have published (see
> above)
> > > > is a
> > > > > > > direct
> > > > > > > > > > result
> > > > > > > > > > > > of this close collaboration.
> > > > > > > > > > > >
> > > > > > > > > > > > Because sketches are small they can be processed
> > > extremely
> > > > > > fast,
> > > > > > > > > often
> > > > > > > > > > many
> > > > > > > > > > > > orders-of-magnitude faster than traditional exact
> > > > > computations.
> > > > > > > For
> > > > > > > > > > > > interactive queries there may not be other viable
> > > > > alternatives,
> > > > > > > and
> > > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > case of real-time analysis, sketches are the only
> known
> > > > > > solution.
> > > > > > > > > > > >
> > > > > > > > > > > > For any system that needs to extract useful
> information
> > > > from
> > > > > > > massive
> > > > > > > > > > data
> > > > > > > > > > > > sketches are essential tools that should be tightly
> > > > > integrated
> > > > > > > into
> > > > > > > > > the
> > > > > > > > > > > > system’s analysis capabilities. This technology has
> > > helped
> > > > > > Yahoo
> > > > > > > > > > > > successfully reduce data processing times from days
> to
> > > > hours
> > > > > or
> > > > > > > > > > minutes on
> > > > > > > > > > > > a number of its internal platforms and has enabled
> > > > subsecond
> > > > > > > queries
> > > > > > > > > on
> > > > > > > > > > > > real-time platforms that would have been infeasible
> > > without
> > > > > > > sketches.
> > > > > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > > > > Other open source implementations of sketch
> algorithms
> > > can
> > > > be
> > > > > > > found
> > > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > Internet. However, we have not yet found any open
> source
> > > > > > > > > > implementations
> > > > > > > > > > > > that are as comprehensive, engineered with the
> quality
> > > > > required
> > > > > > > for
> > > > > > > > > > > > production systems, and with usable and guaranteed
> error
> > > > > > > properties.
> > > > > > > > > > Large
> > > > > > > > > > > > Internet companies, such as Google and Facebook, have
> > > > > published
> > > > > > > > > papers
> > > > > > > > > > on
> > > > > > > > > > > > sketching, however, their implementations of their
> > > > published
> > > > > > > > > > algorithms are
> > > > > > > > > > > > proprietary and not available as open source.
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library already provides
> integrations
> > > > with a
> > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > > major Apache data processing platforms such as Apache
> > > Hive,
> > > > > > > Apache
> > > > > > > > > Pig,
> > > > > > > > > > > > Apache Spark and Apache Druid, and is also integrated
> > > with
> > > > a
> > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > > other open source data processing platforms such as
> > > Splice
> > > > > > > Machine,
> > > > > > > > > > GCHQ
> > > > > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > > > > >
> > > > > > > > > > > > We believe that having DataSketches as an Apache
> project
> > > > will
> > > > > > > provide
> > > > > > > > > > an
> > > > > > > > > > > > immediate, worthwhile, and substantial contribution
> to
> > > the
> > > > > open
> > > > > > > > > source
> > > > > > > > > > > > community, will have a better opportunity to provide
> a
> > > > > > meaningful
> > > > > > > > > > > > contribution to both the science and engineering of
> > > > sketching
> > > > > > > > > > algorithms,
> > > > > > > > > > > > and integrate with other Apache projects.  In
> addition,
> > > > this
> > > > > > is a
> > > > > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > > > > destination
> > > > > > > for
> > > > > > > > > > users
> > > > > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > > > > >
> > > > > > > > > > > > == Initial Goals ==
> > > > > > > > > > > >
> > > > > > > > > > > > We are breaking our initial goals into short-term
> (2-6
> > > > > months)
> > > > > > > and
> > > > > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > > > > >
> > > > > > > > > > > > Our short-term goals include:
> > > > > > > > > > > >
> > > > > > > > > > > > * Understanding and adapting to the Apache
> development
> > > > > process
> > > > > > > and
> > > > > > > > > > > > structures.
> > > > > > > > > > > >
> > > > > > > > > > > > * Start refactoring codebase and move various
> > > DataSketches
> > > > > > > > > repositories
> > > > > > > > > > > > code to Apache Git repository.
> > > > > > > > > > > >
> > > > > > > > > > > > * Continue development of new features, functions,
> and
> > > > fixes.
> > > > > > > > > > > >
> > > > > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> > > > continue
> > > > > to
> > > > > > > be
> > > > > > > > > > > > developed and expanded.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The intermediate to long term goals include:
> > > > > > > > > > > >
> > > > > > > > > > > > * Completing the design and implementation of the C++
> > > > > sketches
> > > > > > to
> > > > > > > > > > > > complement what is already available in Java, and the
> > > > Python
> > > > > > > wrappers
> > > > > > > > > > of
> > > > > > > > > > > > those C++ sketches.
> > > > > > > > > > > >
> > > > > > > > > > > > * Expanding the C++ build framework to include
> Windows
> > > and
> > > > > the
> > > > > > > > > popular
> > > > > > > > > > > > Linux variants.
> > > > > > > > > > > >
> > > > > > > > > > > > * Continued engagement with the scientific research
> > > > community
> > > > > > on
> > > > > > > the
> > > > > > > > > > > > development of new algorithms for computationally
> > > difficult
> > > > > > > problems
> > > > > > > > > > that
> > > > > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > > > > >
> > > > > > > > > > > > == Current Status ==
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches GitHub project has been quite
> > > successful.
> > > > > As
> > > > > > of
> > > > > > > > > this
> > > > > > > > > > > > writing (Feb, 2019) the number of downloads measured
> by
> > > the
> > > > > > Nexus
> > > > > > > > > > > > Repository Manager at https://oss.sonatype.org has
> grown
> > > > by
> > > > > > > nearly a
> > > > > > > > > > > > factor
> > > > > > > > > > > > of 10 over the past year to about 55 thousand per
> month.
> > > > The
> > > > > > > > > > > > DataSketches/sketches-core repository has about 560
> stars
> > > > and
> > > > > > 141
> > > > > > > > > > forks,
> > > > > > > > > > > > which is pretty good for a highly specialized
> library.
> > > > > > > > > > > >
> > > > > > > > > > > > === Development Practices ===
> > > > > > > > > > > >
> > > > > > > > > > > > ==== Source Control ====
> > > > > > > > > > > >
> > > > > > > > > > > > All of our developers have extensive experience with
> Git
> > > > > > version
> > > > > > > > > > control
> > > > > > > > > > > > and follow accepted practices for use of Pull
> Requests
> > > > (PRs),
> > > > > > > code
> > > > > > > > > > reviews
> > > > > > > > > > > > and commits to master, for example.
> > > > > > > > > > > >
> > > > > > > > > > > > ==== Testing ====
> > > > > > > > > > > >
> > > > > > > > > > > > Sketches, by their nature are probabilistic programs
> and
> > > > > don’t
> > > > > > > > > > necessarily
> > > > > > > > > > > > behave deterministically.  For some of the sketches
> we
> > > > > > > intentionally
> > > > > > > > > > insert
> > > > > > > > > > > > random noise into the code as this gives us the
> > > > mathematical
> > > > > > > > > properties
> > > > > > > > > > > > that we need to guarantee accuracy.  This can make
> the
> > > > > behavior
> > > > > > > of
> > > > > > > > > > these
> > > > > > > > > > > > algorithms quite unintuitive and provides significant
> > > > > > challenges
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > developer who wishes to test these algorithms for
> > > > > correctness.
> > > > > > > As a
> > > > > > > > > > result,
> > > > > > > > > > > > our testing strategy includes two major components:
> unit
> > > > > tests,
> > > > > > > and
> > > > > > > > > > > > characterization tests.
> > > > > > > > > > > >
> > > > > > > > > > > > ===== Unit Testing =====
> > > > > > > > > > > >
> > > > > > > > > > > > Our unit tests are primarily quick tests to make sure
> > > that
> > > > we
> > > > > > > > > exercise
> > > > > > > > > > all
> > > > > > > > > > > > critical paths in the code and that key branches are
> > > > executed
> > > > > > > > > > correctly. It
> > > > > > > > > > > > is important that they execute relatively fast as
> they
> > > are
> > > > > > > generally
> > > > > > > > > > run on
> > > > > > > > > > > > every code build. The sketches-core repository alone
> has
> > > > > about
> > > > > > 22
> > > > > > > > > > thousand
> > > > > > > > > > > > statements, over 1300 unit tests and code coverage of
> > > about
> > > > > > > 98.2% as
> > > > > > > > > > > > measured by Atlassian/Clover.  It is our goal for
> all of
> > > > our
> > > > > > code
> > > > > > > > > > > > repositories that are used in production that they
> have
> > > > code
> > > > > > > coverage
> > > > > > > > > > > > greater than 90%.
> > > > > > > > > > > >
> > > > > > > > > > > > ===== Characterization Testing =====
> > > > > > > > > > > >
> > > > > > > > > > > > In order to test the probabilistic methods that are
> used
> > > to
> > > > > > > interpret
> > > > > > > > > > the
> > > > > > > > > > > > stochastic behaviors of our sketches we have a
> separate
> > > > > > > > > > characterization
> > > > > > > > > > > > repository that is dedicated to this.  To measure
> > > accuracy,
> > > > > for
> > > > > > > > > > example,
> > > > > > > > > > > > requires running thousands of trials at each of many
> > > > > different
> > > > > > > points
> > > > > > > > > > along
> > > > > > > > > > > > the domain axis. Each trial compares its estimated
> > > results
> > > > > > > against a
> > > > > > > > > > known
> > > > > > > > > > > > exact result producing an error for that trial.
> These
> > > > error
> > > > > > > > > > measurements
> > > > > > > > > > > > are then fed into our Quantiles sketch to capture the
> > > > actual
> > > > > > > > > > distribution
> > > > > > > > > > > > of error at that point along the axis. We then select
> > > > > quantile
> > > > > > > > > contours
> > > > > > > > > > > > across all the distributions at points along the
> axis.
> > > > These
> > > > > > > > > contours
> > > > > > > > > > can
> > > > > > > > > > > > then be plotted to reveal the shape of the actual
> error
> > > > > > > distribution.
> > > > > > > > > > These
> > > > > > > > > > > > distributions are not at all Gaussian, in fact they
> can
> > > be
> > > > > > quite
> > > > > > > > > > complex.
> > > > > > > > > > > > Nonetheless, these distributions are then checked
> against
> > > > our
> > > > > > > > > > statistical
> > > > > > > > > > > > guarantees inherent to the specific sketch algorithm
> and
> > > > its
> > > > > > > > > > parameters.
> > > > > > > > > > > > There are many examples of these characterization
> error
> > > > > > > distributions
> > > > > > > > > > on
> > > > > > > > > > > > our website. The runtimes of these tests can be very
> long
> > > > and
> > > > > > can
> > > > > > > > > range
> > > > > > > > > > > > from many minutes to hours, and some can run for
> days.
> > > > > > > Currently, we
> > > > > > > > > > have
> > > > > > > > > > > > separate characterization repositories for Java and
> C++ /
> > > > > > Python.
> > > > > > > > > > > >
> > > > > > > > > > > > It is our goal that we perform this characterization
> > > > analysis
> > > > > > > for all
> > > > > > > > > > of
> > > > > > > > > > > > our sketches.  By definition, the code that runs
> these
> > > > > > > > > characterization
> > > > > > > > > > > > tests is open-source so others can run these tests as
> > > well.
> > > > > We
> > > > > > > do
> > > > > > > > > not
> > > > > > > > > > have
> > > > > > > > > > > > formal releases of this code (because it is not
> > > production
> > > > > > code)
> > > > > > > and
> > > > > > > > > > it is
> > > > > > > > > > > > not published to Maven Central.
> > > > > > > > > > > >
> > > > > > > > > > > > === Meritocracy ===
> > > > > > > > > > > >
> > > > > > > > > > > > DataSketches was initially developed based on
> > > requirements
> > > > > > within
> > > > > > > > > > Yahoo. As
> > > > > > > > > > > > a project on GitHub, DataSketches has received
> > > > contributions
> > > > > > from
> > > > > > > > > > numerous
> > > > > > > > > > > > individual developers from around the world,
> dedicated
> > > > > research
> > > > > > > work
> > > > > > > > > > from
> > > > > > > > > > > > senior scientists at Amazon and Visa, and academic
> > > > > researchers
> > > > > > > from
> > > > > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > > > > >
> > > > > > > > > > > > As a project under incubation, we are committed to
> > > > expanding
> > > > > > our
> > > > > > > > > > effort to
> > > > > > > > > > > > build an environment which supports a meritocracy.
> We are
> > > > > > > focused on
> > > > > > > > > > > > engaging the community and other related projects for
> > > > support
> > > > > > and
> > > > > > > > > > > > contributions. Moreover, we are committed to ensure
> > > > > > contributors
> > > > > > > and
> > > > > > > > > > > > committers to DataSketches come from a broad mix of
> > > > > > organizations
> > > > > > > > > > through a
> > > > > > > > > > > > merit-based decision process during incubation. We
> > > believe
> > > > > > > strongly
> > > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > DataSketches premise that fulfills the concept of a
> well
> > > > > > > engineered
> > > > > > > > > and
> > > > > > > > > > > > scientifically rigorous library that implements these
> > > > > powerful
> > > > > > > > > > algorithms
> > > > > > > > > > > > and are committed to growing an inclusive community
> of
> > > > > > > DataSketches
> > > > > > > > > > > > contributors and users.
> > > > > > > > > > > >
> > > > > > > > > > > > === Community ===
> > > > > > > > > > > >
> > > > > > > > > > > > Yahoo has a long history and active engagement in the
> > > Open
> > > > > > Source
> > > > > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> > > > Moloch,
> > > > > > > > > Panoptes,
> > > > > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > > > > TensorFlowOnSpark,
> > > > > > > > > > gifshot,
> > > > > > > > > > > > fluxible, as well as the creation, contribution and
> > > > > incubation
> > > > > > of
> > > > > > > > > many
> > > > > > > > > > > > Apache projects such as Apache Hadoop, Pig,
> Bookkeeper,
> > > > > Oozie,
> > > > > > > > > > Zookeeper,
> > > > > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many
> > > more.
> > > > > > > > > > > >
> > > > > > > > > > > > Every day, DataSketches is actively used by a
> > > organizations
> > > > > and
> > > > > > > > > > > > institutions around the world for batch and stream
> > > > processing
> > > > > > of
> > > > > > > > > data.
> > > > > > > > > > We
> > > > > > > > > > > > believe acceptance will allow us to consolidate
> existing
> > > > > > > > > > > > DataSketches-related work, grow the DataSketches
> > > community,
> > > > > and
> > > > > > > > > deepen
> > > > > > > > > > > > connections between DataSketches and other open
> source
> > > > > > projects.
> > > > > > > > > > > >
> > > > > > > > > > > > === Introduction to the Core Developers &
> Contributors
> > > ===
> > > > > > > > > > > >
> > > > > > > > > > > > The core developers and contributors for
> DataSketches are
> > > > > from
> > > > > > > > > diverse
> > > > > > > > > > > > backgrounds, but primarily are scientists that love
> > > > > engineering
> > > > > > > and
> > > > > > > > > > > > engineers that love science. A large part of the
> value we
> > > > > bring
> > > > > > > comes
> > > > > > > > > > from
> > > > > > > > > > > > this synthesis.  These individuals have already
> > > contributed
> > > > > > > > > > substantially
> > > > > > > > > > > > to the code, algorithms, and/or mathematical proofs
> that
> > > > form
> > > > > > the
> > > > > > > > > > basis of
> > > > > > > > > > > > the library.
> > > > > > > > > > > >
> > > > > > > > > > > > This core group also form the Initial Committers with
> > > write
> > > > > > > > > > permissions to
> > > > > > > > > > > > the repository. Those marked with (*) Meet weekly to
> plan
> > > > the
> > > > > > > > > research
> > > > > > > > > > and
> > > > > > > > > > > > engineering direction of the project.
> > > > > > > > > > > >
> > > > > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > > > > >
> > > > > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo
> Labs,
> > > > > Israel.
> > > > > > > > > > Interests:
> > > > > > > > > > > > distributed systems, scalable systems and platforms
> for
> > > big
> > > > > > data
> > > > > > > > > > > > processing, concurrent algorithms and data
> structures,
> > > > > > > > > > > >
> > > > > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist,
> Yahoo
> > > > > Labs,
> > > > > > > > > > Sunnyvale,
> > > > > > > > > > > > California. Interests: algorithms, theoretical and
> > > applied
> > > > > > > > > mathematics,
> > > > > > > > > > > > encoding and compression theory, theoretical and
> applied
> > > > > > > performance
> > > > > > > > > > > > optimization.
> > > > > > > > > > > >
> > > > > > > > > > > > * Edo Liberty: (*) Director of Research, Head of
> Amazon
> > > AI
> > > > > > Labs,
> > > > > > > Palo
> > > > > > > > > > Alto,
> > > > > > > > > > > > California. Manages the algorithms group at Amazon
> AI. We
> > > > > build
> > > > > > > > > > scalable
> > > > > > > > > > > > machine learning systems and algorithms which are
> used
> > > both
> > > > > > > > > internally
> > > > > > > > > > and
> > > > > > > > > > > > externally by customers of SageMaker, AWS's flagship
> > > > machine
> > > > > > > learning
> > > > > > > > > > > > platform.
> > > > > > > > > > > >
> > > > > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs,
> > > Sunnyvale.
> > > > > > > Interests:
> > > > > > > > > > > > Computational advertising, machine learning, speech
> > > > > > recognition,
> > > > > > > > > > > > data-driven analysis, large scale experimentation,
> big
> > > > data,
> > > > > > > > > > stream/complex
> > > > > > > > > > > > event processing
> > > > > > > > > > > >
> > > > > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department
> of
> > > > > > Computer
> > > > > > > > > > Science,
> > > > > > > > > > > > Georgetown University, Washington D.C. Interests:
> > > > algorithms
> > > > > > and
> > > > > > > > > > > > computational complexity, complexity theory, quantum
> > > > > > algorithms,
> > > > > > > > > > private
> > > > > > > > > > > > data analysis, and learning theory, developing
> efficient
> > > > > > > streaming
> > > > > > > > > and
> > > > > > > > > > > > sketching algorithms
> > > > > > > > > > > >
> > > > > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > > > > >
> > > > > > > > > > > > * Roman Leventov: Senior Software Engineer,
> Metamarkets
> > > /
> > > > > > Snap.
> > > > > > > > > > Interests:
> > > > > > > > > > > > design and implementation of data storing and data
> > > > processing
> > > > > > > > > > (distributed)
> > > > > > > > > > > > systems, performance optimization, CPU performance,
> > > > > mechanical
> > > > > > > > > > sympathy,
> > > > > > > > > > > > JVM performance, API design, databases, (concurrent)
> data
> > > > > > > structures,
> > > > > > > > > > > > memory management, garbage collection algorithms,
> > > language
> > > > > > > design and
> > > > > > > > > > > > runtimes (their tradeoffs), distributed systems
> (cloud)
> > > > > > > efficiency,
> > > > > > > > > > Linux,
> > > > > > > > > > > > code quality, code transformation, pure functional
> > > > > programming
> > > > > > > > > models,
> > > > > > > > > > > > Haskell.
> > > > > > > > > > > >
> > > > > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead
> developer
> > > > and
> > > > > > > founder
> > > > > > > > > > of
> > > > > > > > > > > > the DataSketches project, Yahoo, Sunnyvale,
> California.
> > > > > > > Interests:
> > > > > > > > > > > > streaming algorithms, mathematics, computer science,
> high
> > > > > > > quality and
> > > > > > > > > > high
> > > > > > > > > > > > performance code for the analysis of massive data,
> > > bridging
> > > > > the
> > > > > > > > > divide
> > > > > > > > > > > > between theory and practice.
> > > > > > > > > > > >
> > > > > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer,
> > > Yahoo,
> > > > > > > Sunnyvale,
> > > > > > > > > > > > California. Interests: applied mathematics, computer
> > > > science,
> > > > > > big
> > > > > > > > > data,
> > > > > > > > > > > > distributed systems.
> > > > > > > > > > > >
> > > > > > > > > > > > === Introduction to Additional Interested
> Contributors
> > > ===
> > > > > > > > > > > >
> > > > > > > > > > > > These folks have been intermittently involved and
> > > > > contributed,
> > > > > > > but
> > > > > > > > > are
> > > > > > > > > > > > strong supporters of this project.
> > > > > > > > > > > >
> > > > > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > > > > >
> > > > > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com]
> Ph.D.
> > > > > > Computer
> > > > > > > > > > Science,
> > > > > > > > > > > > Univ of Utah. Interests: Machine Learning, Data
> Mining,
> > > > > matrix
> > > > > > > > > > > > approximation, streaming algorithms, randomized
> linear
> > > > > algebra.
> > > > > > > > > > > >
> > > > > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot
> com]
> > > > > Ph.D.
> > > > > > > > > > Computer
> > > > > > > > > > > > Science, Research Instructor, Princeton University.
> > > > > Interests:
> > > > > > > > > > algorithmic
> > > > > > > > > > > > foundations of data science and machine learning,
> > > efficient
> > > > > > > methods
> > > > > > > > > for
> > > > > > > > > > > > processing and understanding large datasets, often
> > > working
> > > > at
> > > > > > the
> > > > > > > > > > > > intersection of theoretical computer science,
> numerical
> > > > > linear
> > > > > > > > > > algebra, and
> > > > > > > > > > > > optimization.
> > > > > > > > > > > >
> > > > > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk]
> Ph.D.
> > > > > > > Computer
> > > > > > > > > > Science,
> > > > > > > > > > > > Professor, Warwick University, Warwick, England.
> > > Interests:
> > > > > all
> > > > > > > > > > aspects of
> > > > > > > > > > > > the "data lifecycle", from data collection and
> cleaning,
> > > > > > through
> > > > > > > > > > mining and
> > > > > > > > > > > > analytics. (Professor Cormode is one of the world’s
> > > leading
> > > > > > > > > scientists
> > > > > > > > > > in
> > > > > > > > > > > > sketching algorithms)
> > > > > > > > > > > >
> > > > > > > > > > > > === Alignment ===
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library already provides
> integrations
> > > and
> > > > > > > example
> > > > > > > > > > code for
> > > > > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > > > > integrated
> > > > > > > into
> > > > > > > > > > Apache
> > > > > > > > > > > > Druid.
> > > > > > > > > > > >
> > > > > > > > > > > > == Known Risks ==
> > > > > > > > > > > >
> > > > > > > > > > > > The following subsections are specific risks that
> have
> > > been
> > > > > > > > > identified
> > > > > > > > > > by
> > > > > > > > > > > > the ASF that need to be addressed.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches library is presently used by a
> number of
> > > > > > > > > > organizations,
> > > > > > > > > > > > from small startups to Fortune 100 companies, to
> > > construct
> > > > > > > production
> > > > > > > > > > > > pipelines that must process and analyze massive data.
> > > Yahoo
> > > > > > has a
> > > > > > > > > > long-term
> > > > > > > > > > > > commitment to continue to advance the DataSketches
> > > library;
> > > > > > > moreover,
> > > > > > > > > > > > DataSketches is seeing increasing interest,
> development,
> > > > and
> > > > > > > adoption
> > > > > > > > > > from
> > > > > > > > > > > > many diverse organizations from around the world.
> Due to
> > > > its
> > > > > > > growing
> > > > > > > > > > > > adoption, we feel it is quite unlikely that this
> project
> > > > > would
> > > > > > > become
> > > > > > > > > > > > orphaned.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > > > > >
> > > > > > > > > > > > Yahoo believes strongly in open source and the
> exchange
> > > of
> > > > > > > > > information
> > > > > > > > > > to
> > > > > > > > > > > > advance new ideas and work. Examples of this
> commitment
> > > are
> > > > > > > active
> > > > > > > > > open
> > > > > > > > > > > > source projects such as those mentioned above. With
> > > > > > > DataSketches, we
> > > > > > > > > > have
> > > > > > > > > > > > been increasingly open and forward-looking; we have
> > > > > published a
> > > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > > papers about breakthrough developments in the
> science of
> > > > > > > streaming
> > > > > > > > > > > > algorithms (mentioned above) that also reference the
> > > > > > DataSketches
> > > > > > > > > > library.
> > > > > > > > > > > > Our submission to the Apache Software Foundation is a
> > > > logical
> > > > > > > > > > extension of
> > > > > > > > > > > > our commitment to open source software.
> > > > > > > > > > > >
> > > > > > > > > > > > Key committers at Yahoo with strong open source
> > > backgrounds
> > > > > > > include
> > > > > > > > > > Aaron
> > > > > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > > > > Braginsky,
> > > > > > > > > Andrews
> > > > > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen,
> > > Bryan
> > > > > > Call,
> > > > > > > > > Daryn
> > > > > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric
> Payne,
> > > > > Eshcar
> > > > > > > > > Hillel,
> > > > > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu,
> Francisco
> > > > > > > > > Perez-Sorrosal,
> > > > > > > > > > Gil
> > > > > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai
> > > Asher,
> > > > > > James
> > > > > > > > > > Penick,
> > > > > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe
> Francis, Jon
> > > > > > Eagles,
> > > > > > > > > > Kihwal
> > > > > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla,
> > > Michael
> > > > > > > Trelinski,
> > > > > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham,
> Olga
> > > L.
> > > > > > > > > Natkovich,
> > > > > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini
> Palaniswamy,
> > > > > Ruby
> > > > > > > Loo,
> > > > > > > > > > Ryan
> > > > > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao
> Saley, Shu
> > > > Kit
> > > > > > > Chan,
> > > > > > > > > Sri
> > > > > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and
> > > many
> > > > > > more.
> > > > > > > > > > > >
> > > > > > > > > > > > All of our core developers are committed to learn
> about
> > > the
> > > > > > > Apache
> > > > > > > > > > process
> > > > > > > > > > > > and to give back to the community.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > > > > >
> > > > > > > > > > > > The majority of committers in this proposal belong to
> > > Yahoo
> > > > > due
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > fact
> > > > > > > > > > > > that DataSketches has emerged from an internal Yahoo
> > > > project.
> > > > > > > This
> > > > > > > > > > proposal
> > > > > > > > > > > > also includes developers and contributors from other
> > > > > companies,
> > > > > > > and
> > > > > > > > > > who are
> > > > > > > > > > > > actively involved with other Apache projects, such as
> > > > Druid.
> > > > > > We
> > > > > > > > > > expect our
> > > > > > > > > > > > entry into incubation will allow us to expand the
> number
> > > of
> > > > > > > > > > individuals and
> > > > > > > > > > > > organizations participating in DataSketches
> development.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > > > > >
> > > > > > > > > > > > Because the DataSketches library originated within
> Yahoo,
> > > > it
> > > > > > has
> > > > > > > been
> > > > > > > > > > > > developed primarily by salaried Yahoo developers and
> we
> > > > > expect
> > > > > > > that
> > > > > > > > > to
> > > > > > > > > > > > continue to be the case near term. However, since we
> > > placed
> > > > > > this
> > > > > > > > > > library
> > > > > > > > > > > > into open-source we have had a number of significant
> > > > > > > contributions
> > > > > > > > > from
> > > > > > > > > > > > engineers and scientists from outside of Yahoo. We
> expect
> > > > our
> > > > > > > > > reliance
> > > > > > > > > > on
> > > > > > > > > > > > Yahoo salaried developers will decrease over time.
> > > > > Nonetheless,
> > > > > > > Yahoo
> > > > > > > > > > is
> > > > > > > > > > > > committed to continue its strong support of this
> > > important
> > > > > > > project.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: Lack of Relationship to other Apache
> Products
> > > ===
> > > > > > > > > > > >
> > > > > > > > > > > > DataSketches already directly interoperates with or
> > > > utilizes
> > > > > > > several
> > > > > > > > > > > > existing Apache projects.
> > > > > > > > > > > >
> > > > > > > > > > > > * Build
> > > > > > > > > > > >    * Apache Maven
> > > > > > > > > > > >
> > > > > > > > > > > > * Integrations and adaptors for the following
> projects
> > > > > > naturally
> > > > > > > have
> > > > > > > > > > them
> > > > > > > > > > > > as dependencies
> > > > > > > > > > > >    * Apache Hive
> > > > > > > > > > > >    * Apache Pig
> > > > > > > > > > > >    * Apache Druid
> > > > > > > > > > > >    * Apache Spark
> > > > > > > > > > > >
> > > > > > > > > > > > * Additional dependencies for the above integrations
> and
> > > > > > adaptors
> > > > > > > > > > include
> > > > > > > > > > > >    * Apache Hadoop
> > > > > > > > > > > >    * Apache Commons (Math)
> > > > > > > > > > > >
> > > > > > > > > > > > There is no other Apache project that we are aware of
> > > that
> > > > > > > duplicates
> > > > > > > > > > the
> > > > > > > > > > > > functionality of the DataSketches library.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: An Excessive Fascination with the Apache
> Brand
> > > > ===
> > > > > > > > > > > >
> > > > > > > > > > > > With this proposal we are not seeking attention or
> > > > publicity.
> > > > > > > Rather,
> > > > > > > > > > we
> > > > > > > > > > > > firmly believe in the DataSketches library and
> concept
> > > and
> > > > > the
> > > > > > > > > ability
> > > > > > > > > > to
> > > > > > > > > > > > make the DataSketches library a powerful, yet
> > > simple-to-use
> > > > > > > toolkit
> > > > > > > > > for
> > > > > > > > > > > > data processing. While the DataSketches library has
> been
> > > > open
> > > > > > > source,
> > > > > > > > > > we
> > > > > > > > > > > > believe putting code on GitHub can only go so far.
> We see
> > > > the
> > > > > > > Apache
> > > > > > > > > > > > community, processes, and mission as critical for
> > > ensuring
> > > > > the
> > > > > > > > > > DataSketches
> > > > > > > > > > > > library is truly community-driven, positively
> impactful,
> > > > and
> > > > > > > > > innovative
> > > > > > > > > > > > open source software. While Yahoo has taken a number
> of
> > > > steps
> > > > > > to
> > > > > > > > > > advance
> > > > > > > > > > > > its various open source projects, we believe the
> > > > DataSketches
> > > > > > > library
> > > > > > > > > > > > project is a great fit for the Apache Software
> Foundation
> > > > due
> > > > > > to
> > > > > > > its
> > > > > > > > > > focus
> > > > > > > > > > > > on data processing and its relationships to existing
> ASF
> > > > > > > projects.
> > > > > > > > > > > >
> > > > > > > > > > > > === Risk: Cryptography ===
> > > > > > > > > > > >
> > > > > > > > > > > > DataSketches does not contain any cryptographic code
> and
> > > is
> > > > > > not a
> > > > > > > > > > > > cryptographic product.
> > > > > > > > > > > >
> > > > > > > > > > > > == Documentation ==
> > > > > > > > > > > >
> > > > > > > > > > > > The following documentation is relevant to this
> proposal.
> > > > > > > Relevant
> > > > > > > > > > portions
> > > > > > > > > > > > of the documentation will be contributed to the
> Apache
> > > > > > > DataSketches
> > > > > > > > > > > > project.
> > > > > > > > > > > >
> > > > > > > > > > > > * DataSketches website:
> https://datasketches.github.io.
> > > > > > > > > > > >
> > > > > > > > > > > > * DataSketches website repository:
> > > > > > > > > > > >
> https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > > > > >
> > > > > > > > > > > > We will need an apache website for this documentation
> > > > similar
> > > > > > to
> > > > > > > > > > > >
> > > > > > > > > > > > * https://datasketches.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > == Initial Source ==
> > > > > > > > > > > >
> > > > > > > > > > > > The initial source for DataSketches which we will
> submit
> > > to
> > > > > the
> > > > > > > > > Apache
> > > > > > > > > > > > Foundation will include a number of repositories
> which
> > > are
> > > > > > > currently
> > > > > > > > > > hosted
> > > > > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > > > > >
> > > > > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > > > > >
> > > > > > > > > > > > * Java
> > > > > > > > > > > >    * sketches-core: This repository has the core
> > > sketching
> > > > > > > classes,
> > > > > > > > > > which
> > > > > > > > > > > > are leveraged by some of the other repositories. This
> > > > > > repository
> > > > > > > has
> > > > > > > > > no
> > > > > > > > > > > > external dependencies outside of the
> DataSketches/memory
> > > > > > > repository,
> > > > > > > > > > Java
> > > > > > > > > > > > and TestNG for unit tests. This code is versioned
> and the
> > > > > > latest
> > > > > > > > > > release
> > > > > > > > > > > > can be obtained from Maven Central.
> > > > > > > > > > > >    * memory: Low level, high-performance memory
> > > > > data-structure
> > > > > > > > > > management
> > > > > > > > > > > > primarily for off-heap.
> > > > > > > > > > > >    * sketches-android: This is a new repository
> dedicated
> > > > to
> > > > > > > sketches
> > > > > > > > > > > > designed to be run in a mobile client, such as a cell
> > > > phone.
> > > > > It
> > > > > > > is
> > > > > > > > > > still in
> > > > > > > > > > > > development and should be considered experimental.
> > > > > > > > > > > >    * sketches-hive: This repository contains Hive
> UDFs
> > > and
> > > > > > UDAFs
> > > > > > > for
> > > > > > > > > > use
> > > > > > > > > > > > within Hadoop grid environments. This code has
> > > dependencies
> > > > > on
> > > > > > > > > > > > sketches-core as well as Hadoop and Hive. Users of
> this
> > > > code
> > > > > > are
> > > > > > > > > > advised to
> > > > > > > > > > > > use Maven to bring in all the required dependencies.
> This
> > > > > code
> > > > > > is
> > > > > > > > > > versioned
> > > > > > > > > > > > and the latest release can be obtained from Maven
> > > Central.
> > > > > > > > > > > >    * sketches-pig: This repository contains Pig User
> > > > Defined
> > > > > > > > > Functions
> > > > > > > > > > > > (UDF) for use within Hadoop grid environments. This
> code
> > > > has
> > > > > > > > > > dependencies
> > > > > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of
> this
> > > > > code
> > > > > > > are
> > > > > > > > > > advised
> > > > > > > > > > > > to use Maven to bring in all the required
> dependencies.
> > > > This
> > > > > > > code is
> > > > > > > > > > > > versioned and the latest release can be obtained from
> > > Maven
> > > > > > > Central.
> > > > > > > > > > > >    * sketches-vector: This is a new repository
> dedicated
> > > to
> > > > > > > sketches
> > > > > > > > > > for
> > > > > > > > > > > > vector and matrix operations. It is still somewhat
> > > > > > experimental.
> > > > > > > > > > > >    * characterization: This relatively new
> repository is
> > > > for
> > > > > > code
> > > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > > use to characterize the accuracy and speed
> performance of
> > > > the
> > > > > > > > > sketches
> > > > > > > > > > in
> > > > > > > > > > > > the library and is constantly being updated.
> Examples of
> > > > the
> > > > > > job
> > > > > > > > > > command
> > > > > > > > > > > > files used for various tests can be found in the
> > > > > > > src/main/resources
> > > > > > > > > > > > directory. Some of these tests can run for hours
> > > depending
> > > > on
> > > > > > its
> > > > > > > > > > > > configuration.
> > > > > > > > > > > >    * experimental: This repository is an experimental
> > > > staging
> > > > > > > area
> > > > > > > > > for
> > > > > > > > > > code
> > > > > > > > > > > > that will eventually end up in another repository.
> This
> > > > code
> > > > > is
> > > > > > > not
> > > > > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > > > > >    * sketches-misc: Demos and other code not related
> to
> > > > > > > production
> > > > > > > > > > > > deployment
> > > > > > > > > > > >
> > > > > > > > > > > > * C++ and Python
> > > > > > > > > > > >    * sketches-core-cpp: This is the C++/Python
> companion
> > > to
> > > > > the
> > > > > > > Java
> > > > > > > > > > > > sketches-core. These implementations are binary
> > > compatible
> > > > > with
> > > > > > > their
> > > > > > > > > > > > counterparts in Java. In other words, a sketch
> created
> > > and
> > > > > > > stored in
> > > > > > > > > > C++
> > > > > > > > > > > > can be opened and read in Java and visa-versa. This
> site
> > > > also
> > > > > > > has our
> > > > > > > > > > > > Python adaptors that basically wrap the C++
> > > > implementations,
> > > > > > > making
> > > > > > > > > the
> > > > > > > > > > > > high performance C++ implementations available from
> > > Python.
> > > > > > > > > > > >    * sketches-postgres: This site provides the
> > > > > > postgres-specific
> > > > > > > > > > adaptors
> > > > > > > > > > > > that wrap the C++ implementations making them
> available
> > > to
> > > > > the
> > > > > > > > > Postgres
> > > > > > > > > > > > database users.
> > > > > > > > > > > >    * characterization-cpp: This is the C++/Python
> > > companion
> > > > > to
> > > > > > > the
> > > > > > > > > Java
> > > > > > > > > > > > characterization repository.
> > > > > > > > > > > >    * experimental-cpp: This repository is an
> experimental
> > > > > > staging
> > > > > > > > > area
> > > > > > > > > > for
> > > > > > > > > > > > C++ code that will eventually end up in another
> > > repository.
> > > > > > > > > > > >
> > > > > > > > > > > > * Command-Line Tools
> > > > > > > > > > > >    * sketches-cmd
> > > > > > > > > > > >    * homebrew-sketches
> > > > > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > > > > >
> > > > > > > > > > > > These projects have always been Apache 2.0 licensed.
> We
> > > > > intend
> > > > > > to
> > > > > > > > > > bundle
> > > > > > > > > > > > all of these repositories since they are all
> > > complementary
> > > > > and
> > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > > maintained in one project. Prior to our submission,
> we
> > > will
> > > > > > > combine
> > > > > > > > > > all of
> > > > > > > > > > > > these projects into a new git repository.
> > > > > > > > > > > >
> > > > > > > > > > > > == Source and Intellectual Property Submission Plan
> ==
> > > > > > > > > > > >
> > > > > > > > > > > > Contributors to the DataSketches project have also
> signed
> > > > the
> > > > > > > Yahoo
> > > > > > > > > > > > Individual Contributor License Agreement (
> > > > > > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > > > > > in order to contribute to the project.
> > > > > > > > > > > >
> > > > > > > > > > > > With respect to trademark rights, Yahoo does not
> hold a
> > > > > > > trademark on
> > > > > > > > > > the
> > > > > > > > > > > > phrase “DataSketches.” Based on feedback and
> guidance we
> > > > > > receive
> > > > > > > > > > during the
> > > > > > > > > > > > incubation process, we are open to renaming the
> project
> > > if
> > > > > > > necessary
> > > > > > > > > > for
> > > > > > > > > > > > trademark or other concerns, but we would prefer not
> to
> > > > have
> > > > > to
> > > > > > > do
> > > > > > > > > > that.
> > > > > > > > > > > >
> > > > > > > > > > > > == External Dependencies ==
> > > > > > > > > > > >
> > > > > > > > > > > > All external dependencies are licensed under an
> Apache
> > > 2.0
> > > > or
> > > > > > > > > > > > Apache-compatible license. As we grow the
> DataSketches
> > > > > > community
> > > > > > > we
> > > > > > > > > > will
> > > > > > > > > > > > configure our build process to require and validate
> all
> > > > > > > contributions
> > > > > > > > > > and
> > > > > > > > > > > > dependencies are licensed under the Apache 2.0
> license or
> > > > are
> > > > > > > under
> > > > > > > > > an
> > > > > > > > > > > > Apache-compatible license.
> > > > > > > > > > > >
> > > > > > > > > > > > == Required Resources ==
> > > > > > > > > > > >
> > > > > > > > > > > > === Mailing Lists ===
> > > > > > > > > > > >
> > > > > > > > > > > > We currently use a mix of mailing lists. We will
> migrate
> > > > our
> > > > > > > existing
> > > > > > > > > > > > mailing lists to the following:
> > > > > > > > > > > >
> > > > > > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > === Source Control ===
> > > > > > > > > > > >
> > > > > > > > > > > > The DataSketches team currently uses Git and would
> like
> > > to
> > > > > > > continue
> > > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > > > so. We request a Git repository for DataSketches with
> > > > > mirroring
> > > > > > > to
> > > > > > > > > > GitHub
> > > > > > > > > > > > enabled similar the following:
> > > > > > > > > > > >
> > > > > > > > > > > > *
> https://github.com/apache/incubator-datasketches.git
> > > > > > > > > > > >
> > > > > > > > > > > > === Issue Tracking ===
> > > > > > > > > > > >
> > > > > > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > > > > > DataSketches
> > > > > > > > > > project
> > > > > > > > > > > > is currently using the public GitHub issue tracker
> and
> > > the
> > > > > > public
> > > > > > > > > > Google
> > > > > > > > > > > > Groups forum/sketches-user for issue tracking and
> > > > > discussions.
> > > > > > We
> > > > > > > > > will
> > > > > > > > > > > > migrate and combine from these two sources to the
> Apache
> > > > > JIRA.
> > > > > > > > > > > >
> > > > > > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > > > > > >
> > > > > > > > > > > > == Initial Committers ==
> > > > > > > > > > > >
> > > > > > > > > > > > The following list of individuals have been extremely
> > > > active
> > > > > in
> > > > > > > our
> > > > > > > > > > > > community and should have write (commit) permissions
> to
> > > the
> > > > > > > > > repository.
> > > > > > > > > > > >
> > > > > > > > > > > > * Eshcar Hillel                      [eshcar at
> > > > verizonmedia
> > > > > > dot
> > > > > > > com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Kevin Lang                    [langk at
> verizonmedia
> > > dot
> > > > > com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Roman Leventov              [roman.leventov at
> > > > > c.metamarkets
> > > > > > > dot
> > > > > > > > > com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Edo Liberty                   [libertye at amazon
> dot
> > > > com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Jon Malkin                    [jmalkin at
> verizonmedia
> > > > dot
> > > > > > com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Lee Rhodes                  [lrhodes at
> verizonmedia
> > > dot
> > > > > > com] &
> > > > > > > > > > [leerho
> > > > > > > > > > > > at gmail dot com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Alexander Saydakov         [saydakov at
> verizonmedia
> > > dot
> > > > > com]
> > > > > > > > > > > >
> > > > > > > > > > > > * Justin Thaler                 [justin.thaler at
> > > > georgetown
> > > > > > dot
> > > > > > > edu]
> > > > > > > > > > > >
> > > > > > > > > > > > == Affiliations ==
> > > > > > > > > > > >
> > > > > > > > > > > > The initial committers are from four organizations:
> > > Yahoo,
> > > > > > > Amazon,
> > > > > > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > > > > > >
> > > > > > > > > > > > === Champion ===
> > > > > > > > > > > > (Recommended to me: )
> > > > > > > > > > > >
> > > > > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > > > > [chenliang613
> > > > > > at
> > > > > > > > > > apache
> > > > > > > > > > > > dot org]
> > > > > > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > > > > > >
> > > > > > > > > > > > === Nominated Mentors ===
> > > > > > > > > > > > (Recommended to me: )
> > > > > > > > > > > >
> > > > > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > > > > [chenliang613
> > > > > > at
> > > > > > > > > > apache
> > > > > > > > > > > > dot org]
> > > > > > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > > > > > >
> > > > > > > > > > > > === Sponsoring Entity ===
> > > > > > > > > > > >
> > > > > > > > > > > > * The Apache Incubator    **** This is our 1st choice
> > > ****
> > > > > > > > > > > >
> > > > > > > > > > > > * Apache Druid. The incubating Apache Druid project
> might
> > > > > also
> > > > > > > be a
> > > > > > > > > > logical
> > > > > > > > > > > > sponsor. However, DataSketches has applications in
> many
> > > > areas
> > > > > > of
> > > > > > > > > > computing
> > > > > > > > > > > > outside of Druid so our preference and
> recommendation is
> > > > that
> > > > > > > > > > DataSketches
> > > > > > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > > > > > >
> > > > > > > > > > > > ________________
> > > > > > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > > > > > previously
> > > > > > > > > > acquired
> > > > > > > > > > > > AOL. The merged entity was originally called Oath,
> Inc.,
> > > > but
> > > > > > has
> > > > > > > > > > recently
> > > > > > > > > > > > been renamed Verizon Media, Inc., a wholly-owned
> > > subsidiary
> > > > > of
> > > > > > > > > Verizon,
> > > > > > > > > > > > Inc.  Since Yahoo is the more recognized name,
> references
> > > > in
> > > > > > this
> > > > > > > > > > document
> > > > > > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > > > > > kenn@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > The subject line has me interested already. Follow
> > > > examples
> > > > > > > like
> > > > > > > > > this
> > > > > > > > > > > > > maybe?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > > > > 2.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kenn
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <
> > > leerho@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I'll try again ... :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > > > > > ted.dunning@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> It didn't make it again
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <
> > > > leerho@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > I'm not sure the attached document made it
> > > through.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > > > > > leerho@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > > > > > For additional commands, e-mail:
> > > > > > > > > general-help@incubator.apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > > general-help@incubator.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > --
> > > > > > > > From my cell phone.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> general-unsubscribe@incubator.apache.org
> > > > > > > For additional commands, e-mail:
> general-help@incubator.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > --
> > > From my cell phone.
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

DataSketches Proposal - Google Docs Link

Posted by le...@gmail.com, le...@gmail.com.

As primary author, can I be given the ability to directly edit? 
On 2019/02/26 05:37:22, Kenneth Knowles <ke...@apache.org> wrote: 
> It isn't too much work, so I've done it:
> https://s.apache.org/datasketches-proposal-draft
> 
> Kenn
> 
> On Mon, Feb 25, 2019 at 9:31 PM leerho <le...@gmail.com> wrote:
> 
> > Yes, I thought of that.  But it’s not like I’m being overwhelmed with
> > requests to comment ... so far it has been only 3 or 4, and the requested
> > changes have been minor.  I’m assuming that if there are no more
> > substantive changes after this week that the document would be moved to the
> > wiki archive, where, I presume, changes could still be made.
> >
> > I want to do the right thing here, so if you feel that the document would
> > get much better feedback on an unrestricted gDoc site, I will set it up.
> >
> >
> >
> > On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jb...@cloudera.com.invalid>
> > wrote:
> >
> > > You could use a Google account that is not under Yahoo’s control, then
> > let
> > > anyone in the world add a comment, maybe.
> > >
> > > On Mon, Feb 25, 2019 at 3:26 PM leerho <le...@gmail.com> wrote:
> > >
> > > > Ken,
> > > > Yahoo does not allow me to create a shared link outside our company,
> > > except
> > > > to individual email addresses.  So attempting to share it to the email
> > > > general@incubator.apache.org may not work.  Nonetheless, several
> > > > individuals were able to request access using their individual email
> > > > accounts and I was able to add them.  I will try to add you using
> > > > kenn@apache.org, but if that doesn't work, I may need a gmail or
> > > > equivalent
> > > > account for you.
> > > >
> > > > Lee.
> > > >
> > > >
> > > > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org>
> > wrote:
> > > >
> > > > > I could not access that document. I suggest you need to turn on link
> > > > > sharing.
> > > > >
> > > > > Kenn
> > > > >
> > > > > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Try this link:
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > > > >
> > > > > >
> > > > > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > > > > Yes I will try that tomorrow.
> > > > > > >
> > > > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <kenn@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Can you share the Google doc with the proposal? Per Ted's
> > advice,
> > > > we
> > > > > > can
> > > > > > > > iterate quickly there and move it to the wiki when it becomes a
> > > bit
> > > > > > more
> > > > > > > > stable.
> > > > > > > >
> > > > > > > > Kenn
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> > > > leerho@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the offer.  i am a neophyte at this process and
> > > email
> > > > > > app!   I
> > > > > > > > > could use a lot of help getting this off the ground!  Also,
> > I'm
> > > > not
> > > > > > sure
> > > > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this
> > on
> > > > :)
> > > > > > > > >
> > > > > > > > > Lee.
> > > > > > > > >
> > > > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > > > > > > > Nice.
> > > > > > > > > >
> > > > > > > > > > I would very much like to help mentor this project, though
> > > you
> > > > > > already
> > > > > > > > > have
> > > > > > > > > > a couple good ones.
> > > > > > > > > >
> > > > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > > > >
> > > > > > > > > > Kenn (VP Apache Beam)
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com>
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I didn't realize that this mail list does not accept PDF
> > > > files,
> > > > > > > > > apparently
> > > > > > > > > > > only text.  So let me try one more time ... :)  Please
> > let
> > > me
> > > > > > know if
> > > > > > > > > > > this works!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > > > >
> > > > > > > > > > > == Abstract ==
> > > > > > > > > > >
> > > > > > > > > > > DataSketches.GitHub.io is an open source,
> > high-performance
> > > > > > library
> > > > > > > > of
> > > > > > > > > > > stochastic streaming algorithms commonly called
> > "sketches"
> > > in
> > > > > the
> > > > > > > > data
> > > > > > > > > > > sciences. Sketches are small, stateful programs that
> > > process
> > > > > > massive
> > > > > > > > > data
> > > > > > > > > > > as a stream and can provide approximate answers, with
> > > > > > mathematical
> > > > > > > > > > > guarantees, to computationally difficult queries
> > > > > > orders-of-magnitude
> > > > > > > > > faster
> > > > > > > > > > > than traditional, exact methods.
> > > > > > > > > > >
> > > > > > > > > > > This proposal is to move DataSketches to the Apache
> > > Software
> > > > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > > > intellectual
> > > > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > > > officially
> > > > > > > > > known as
> > > > > > > > > > > Apache DataSketches and its evolution and governance
> > would
> > > > come
> > > > > > under
> > > > > > > > > the
> > > > > > > > > > > rules and guidance of the ASF.
> > > > > > > > > > >
> > > > > > > > > > > == Introduction ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library contains carefully crafted
> > > > > > implementations
> > > > > > > > of
> > > > > > > > > > > sketch algorithms that meet rigorous standards of quality
> > > and
> > > > > > > > > performance
> > > > > > > > > > > and provide capabilities required for large-scale
> > > production
> > > > > > systems
> > > > > > > > > that
> > > > > > > > > > > must process and analyze massive data. The DataSketches
> > > core
> > > > > > > > > repository is
> > > > > > > > > > > written in Java with a parallel core repository written
> > in
> > > > C++
> > > > > > that
> > > > > > > > > > > includes Python wrappers. The DataSketches library also
> > > > > includes
> > > > > > > > > special
> > > > > > > > > > > repositories for extending the core library for Apache
> > Hive
> > > > and
> > > > > > > > Apache
> > > > > > > > > Pig.
> > > > > > > > > > > The sketches developed in the different languages share a
> > > > > common
> > > > > > > > binary
> > > > > > > > > > > storage format so that sketches created and stored in
> > Java,
> > > > for
> > > > > > > > > example,
> > > > > > > > > > > can be fully used in C++, and visa versa.  Because the
> > > stored
> > > > > > sketch
> > > > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > > > images),
> > > > > > they
> > > > > > > > > can
> > > > > > > > > > > be shared across many different systems, languages and
> > > > > platforms.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches documentation website,
> > > > > > > > https://datasketches.github.io
> > > > > > > > > ,
> > > > > > > > > > > includes general tutorials, a comprehensive research
> > > section
> > > > > with
> > > > > > > > > > > references to relevant academic papers, extensive
> > examples
> > > > for
> > > > > > using
> > > > > > > > > the
> > > > > > > > > > > core library directly as well as examples for accessing
> > the
> > > > > > library
> > > > > > > > in
> > > > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library also includes a characterization
> > > > > > repository
> > > > > > > > > for
> > > > > > > > > > > long running test programs that are used for studying
> > > > accuracy
> > > > > > and
> > > > > > > > > > > performance of these sketches over wide ranges of input
> > > > > > variables.
> > > > > > > > The
> > > > > > > > > data
> > > > > > > > > > > produced by these programs is used for generating the
> > many
> > > > > > > > performance
> > > > > > > > > > > plots contained in the documentation website and for
> > > academic
> > > > > > > > > > > publications.
> > > > > > > > > > >
> > > > > > > > > > > The code repositories used for production are versioned
> > and
> > > > > > published
> > > > > > > > > to
> > > > > > > > > > > Maven Central on periodic intervals as the library
> > evolves.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library also includes several
> > experimental
> > > > > > > > > repositories
> > > > > > > > > > > for use-cases outside the large-scale systems
> > environments,
> > > > > such
> > > > > > as
> > > > > > > > > > > sketches for mobile, IoT devices (Android), command-line
> > > > access
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > sketch library, and an experimental repository for
> > > > vector-based
> > > > > > > > > sketches
> > > > > > > > > > > that performs approximate Singular Value Decomposition
> > > (SVD)
> > > > > > analysis
> > > > > > > > > that
> > > > > > > > > > > could potentially be used in Machine Learning (ML)
> > > > > applications.
> > > > > > > > > > >
> > > > > > > > > > > == Background ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library was started in 2012 as internal
> > > > Yahoo
> > > > > > > > project
> > > > > > > > > to
> > > > > > > > > > > dramatically reduce time and resources required for
> > > distinct
> > > > > > (unique)
> > > > > > > > > > > counting.  An extensive search on the Internet at the
> > time
> > > > > > yielded a
> > > > > > > > > number
> > > > > > > > > > > of theoretical papers on stochastic streaming algorithms
> > > with
> > > > > > > > > pseudocode
> > > > > > > > > > > examples, but we did not find any usable open-source code
> > > of
> > > > > the
> > > > > > > > > quality we
> > > > > > > > > > > felt we needed for our internal production systems.  So
> > we
> > > > > > started a
> > > > > > > > > small
> > > > > > > > > > > project (one person) to develop our own sketches working
> > > > > directly
> > > > > > > > from
> > > > > > > > > > > published theoretical papers.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library was designed from the start with
> > > the
> > > > > > > > > objective of
> > > > > > > > > > > making these algorithms, usually only described in
> > > > theoretical
> > > > > > > > papers,
> > > > > > > > > > > easily accessible to systems developers for use in our
> > > > internal
> > > > > > > > > production
> > > > > > > > > > > systems. By necessity, the code had to be of the highest
> > > > > quality
> > > > > > and
> > > > > > > > > > > thoroughly tested. The wide variety of our internal
> > > > production
> > > > > > > > systems
> > > > > > > > > > > drove the requirement that the sketch implementations had
> > > to
> > > > > > have an
> > > > > > > > > > > absolute minimum of external, run-time dependencies in
> > > order
> > > > to
> > > > > > > > > simplify
> > > > > > > > > > > integration and troubleshooting.
> > > > > > > > > > >
> > > > > > > > > > > Our internal experiments demonstrated dramatic positive
> > > > impact
> > > > > > on the
> > > > > > > > > > > performance of our systems.  As a result, the
> > DataSketches
> > > > > > library
> > > > > > > > > quickly
> > > > > > > > > > > evolved to include different types of sketches for
> > > different
> > > > > > types of
> > > > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > > > algorithms,
> > > > > > > > > > > quantile/histogram algorithms, and weighted and
> > unweighted
> > > > > > sampling
> > > > > > > > > > > algorithms.
> > > > > > > > > > >
> > > > > > > > > > > We quickly discovered that developing these sketch
> > > algorithms
> > > > > to
> > > > > > be
> > > > > > > > > truly
> > > > > > > > > > > robust in production environments is quite difficult and
> > > > > requires
> > > > > > > > deep
> > > > > > > > > > > understanding of the underlying mathematics and
> > statistics
> > > as
> > > > > > well as
> > > > > > > > > > > extensive experience in developing high quality code for
> > > 24/7
> > > > > > > > > production
> > > > > > > > > > > systems. This is a difficult combination of skills for
> > any
> > > > one
> > > > > > > > > organization
> > > > > > > > > > > to collect and maintain over time. It became clear that
> > > this
> > > > > > > > technology
> > > > > > > > > > > needed a community larger than Yahoo to evolve.  In
> > > November,
> > > > > > 2015,
> > > > > > > > > this
> > > > > > > > > > > factor, along with Yahoo’s strong experience and support
> > of
> > > > > open
> > > > > > > > > source,
> > > > > > > > > > > led to the decision to open source this technology under
> > an
> > > > > > Apache
> > > > > > > > 2.0
> > > > > > > > > > > license on GitHub. Since that time our community has
> > > expanded
> > > > > > > > > considerably
> > > > > > > > > > > and the key contributors to this effort includes leading
> > > > > research
> > > > > > > > > > > scientists from a number of universities as well as
> > > > > > practitioners and
> > > > > > > > > > > researchers from a number of major corporations. The core
> > > of
> > > > > this
> > > > > > > > > group is
> > > > > > > > > > > very active as we meet weekly to discuss research
> > > directions
> > > > > and
> > > > > > > > > > > engineering priorities.
> > > > > > > > > > >
> > > > > > > > > > > It is important to note that our internal systems at
> > Yahoo
> > > > use
> > > > > > the
> > > > > > > > > current
> > > > > > > > > > > public GitHub open source DataSketches library and not an
> > > > > > internal
> > > > > > > > > version
> > > > > > > > > > > of the code.
> > > > > > > > > > >
> > > > > > > > > > > The close collaboration of scientific research and
> > > > engineering
> > > > > > > > > development
> > > > > > > > > > > experience with actual massive-data processing systems
> > has
> > > > also
> > > > > > > > > produced
> > > > > > > > > > > new research publications in the field of stochastic
> > > > streaming
> > > > > > > > > algorithms,
> > > > > > > > > > > for example:
> > > > > > > > > > >
> > > > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo
> > Liberty,
> > > > Lee
> > > > > > > > > Rhodes, and
> > > > > > > > > > > Justin Thaler. A high-performance algorithm for
> > identifying
> > > > > > frequent
> > > > > > > > > items
> > > > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > > > >
> > > > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > > > > Thaler. A
> > > > > > > > > > > framework for estimating stream expression cardinalities.
> > > In
> > > > > > > > *EDBT/ICDT
> > > > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > > > > Frequent
> > > > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > > > Proceedings
> > > > > > > > > ‘16,
> > > > > > > > > > > pages 845-854, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty.
> > Optimal
> > > > > > quantile
> > > > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> > > pages
> > > > > > 71–78,
> > > > > > > > > 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > > > optimal
> > > > > > > > > cardinality
> > > > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > > > 2017.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching.
> > > In
> > > > > ACM
> > > > > > KDD
> > > > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > > > Jonathan
> > > > > > > > > Ullman.
> > > > > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> > > > PODS
> > > > > > > > > Proceedings
> > > > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin
> > Thaler.
> > > > > > > > Hierarchical
> > > > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> > > ALENEX
> > > > > > > > > Proceedings
> > > > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > > > >
> > > > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > > > >
> > > > > > > > > > > In the analysis of big data there are often problem
> > queries
> > > > > that
> > > > > > > > don’t
> > > > > > > > > > > scale because they require huge compute resources and
> > time
> > > to
> > > > > > > > generate
> > > > > > > > > > > exact results. Examples include count distinct,
> > quantiles,
> > > > most
> > > > > > > > > frequent
> > > > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > > > >
> > > > > > > > > > > If we can loosen the requirement of “exact” results from
> > > our
> > > > > > queries
> > > > > > > > > and be
> > > > > > > > > > > satisfied with approximate results, within some well
> > > > understood
> > > > > > > > bounds
> > > > > > > > > of
> > > > > > > > > > > error, there is an entire branch of mathematics and data
> > > > > science
> > > > > > that
> > > > > > > > > has
> > > > > > > > > > > evolved around developing algorithms that can produce
> > > > > approximate
> > > > > > > > > results
> > > > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > > > >
> > > > > > > > > > > With the additional requirements that these algorithms
> > must
> > > > be
> > > > > > small
> > > > > > > > > > > (compared to the size of the input data), sublinear (the
> > > size
> > > > > of
> > > > > > the
> > > > > > > > > sketch
> > > > > > > > > > > must grow at a slower rate than the size of the input
> > > > stream),
> > > > > > > > > streaming
> > > > > > > > > > > (they can only touch each data item once), and mergeable
> > > > > > (suitable
> > > > > > > > for
> > > > > > > > > > > distributed processing), defines a class of algorithms
> > that
> > > > can
> > > > > > be
> > > > > > > > > > > described as small, stochastic, streaming, sublinear
> > > > mergeable
> > > > > > > > > algorithms,
> > > > > > > > > > > commonly called sketches (they also have other names, but
> > > we
> > > > > > will use
> > > > > > > > > the
> > > > > > > > > > > term sketches from here on).
> > > > > > > > > > >
> > > > > > > > > > > To be truly streaming and be able to process data in a
> > > single
> > > > > > pass,
> > > > > > > > > > > sketches must make absolute minimum assumptions about the
> > > > input
> > > > > > > > stream.
> > > > > > > > > > > This is critically important, as there is no “second
> > > chance”
> > > > to
> > > > > > > > > process the
> > > > > > > > > > > data.
> > > > > > > > > > >
> > > > > > > > > > > For example, sketches should not make assumptions about
> > the
> > > > > > order of
> > > > > > > > > stream
> > > > > > > > > > > items, the stream length, the dynamic range of values, or
> > > the
> > > > > > > > > distribution
> > > > > > > > > > > of item occurrence frequencies. Sketches should be
> > tolerant
> > > > of
> > > > > > NaNs,
> > > > > > > > > Nulls
> > > > > > > > > > > and empty objects. About the only thing that the sketch
> > > needs
> > > > > to
> > > > > > know
> > > > > > > > > about
> > > > > > > > > > > the stream is how to extract items from it and what type
> > > the
> > > > > > item is,
> > > > > > > > > e.g.,
> > > > > > > > > > > is it a numeric value or a string.
> > > > > > > > > > >
> > > > > > > > > > > As far as the sketch is concerned, the input stream is a
> > > > > > sequence of
> > > > > > > > > items
> > > > > > > > > > > in some unknown random order with unknown random values.
> > > > > > > > > > >
> > > > > > > > > > > The sketch is essentially a complex state machine and
> > > > combined
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > random input stream defines a stochastic process. We then
> > > > apply
> > > > > > > > > > > probabilistic methods to interpret the states of the
> > > > stochastic
> > > > > > > > > process in
> > > > > > > > > > > order to extract useful information about the input
> > stream
> > > > > > itself.
> > > > > > > > The
> > > > > > > > > > > resulting information will be approximate, but we also
> > use
> > > > > > additional
> > > > > > > > > > > probabilistic methods to extract an estimate of the
> > likely
> > > > > > > > probability
> > > > > > > > > > > distribution of error.
> > > > > > > > > > >
> > > > > > > > > > > There is a significant scientific contribution here that
> > is
> > > > > > defining
> > > > > > > > > the
> > > > > > > > > > > state machine, understanding the resulting stochastic
> > > > process,
> > > > > > > > > developing
> > > > > > > > > > > the probabilistic methods, and proving mathematically,
> > that
> > > > it
> > > > > > all
> > > > > > > > > works!
> > > > > > > > > > > This is why the scientific contributors to this project
> > > are a
> > > > > > > > critical
> > > > > > > > > and
> > > > > > > > > > > strategic component to our success.  The development
> > > > engineers
> > > > > > > > > translate
> > > > > > > > > > > the concepts of the proposed state machine and
> > > probabilistic
> > > > > > methods
> > > > > > > > > into
> > > > > > > > > > > production-quality code. Even more important, they work
> > > > closely
> > > > > > with
> > > > > > > > > the
> > > > > > > > > > > scientists, feeding back system and user requirements,
> > > which
> > > > > > leads
> > > > > > > > not
> > > > > > > > > only
> > > > > > > > > > > to superior product design, but to new science as well.
> > A
> > > > > > number of
> > > > > > > > > > > scientific papers our members have published (see above)
> > > is a
> > > > > > direct
> > > > > > > > > result
> > > > > > > > > > > of this close collaboration.
> > > > > > > > > > >
> > > > > > > > > > > Because sketches are small they can be processed
> > extremely
> > > > > fast,
> > > > > > > > often
> > > > > > > > > many
> > > > > > > > > > > orders-of-magnitude faster than traditional exact
> > > > computations.
> > > > > > For
> > > > > > > > > > > interactive queries there may not be other viable
> > > > alternatives,
> > > > > > and
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > case of real-time analysis, sketches are the only known
> > > > > solution.
> > > > > > > > > > >
> > > > > > > > > > > For any system that needs to extract useful information
> > > from
> > > > > > massive
> > > > > > > > > data
> > > > > > > > > > > sketches are essential tools that should be tightly
> > > > integrated
> > > > > > into
> > > > > > > > the
> > > > > > > > > > > system’s analysis capabilities. This technology has
> > helped
> > > > > Yahoo
> > > > > > > > > > > successfully reduce data processing times from days to
> > > hours
> > > > or
> > > > > > > > > minutes on
> > > > > > > > > > > a number of its internal platforms and has enabled
> > > subsecond
> > > > > > queries
> > > > > > > > on
> > > > > > > > > > > real-time platforms that would have been infeasible
> > without
> > > > > > sketches.
> > > > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > > > Other open source implementations of sketch algorithms
> > can
> > > be
> > > > > > found
> > > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > Internet. However, we have not yet found any open source
> > > > > > > > > implementations
> > > > > > > > > > > that are as comprehensive, engineered with the quality
> > > > required
> > > > > > for
> > > > > > > > > > > production systems, and with usable and guaranteed error
> > > > > > properties.
> > > > > > > > > Large
> > > > > > > > > > > Internet companies, such as Google and Facebook, have
> > > > published
> > > > > > > > papers
> > > > > > > > > on
> > > > > > > > > > > sketching, however, their implementations of their
> > > published
> > > > > > > > > algorithms are
> > > > > > > > > > > proprietary and not available as open source.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library already provides integrations
> > > with a
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > major Apache data processing platforms such as Apache
> > Hive,
> > > > > > Apache
> > > > > > > > Pig,
> > > > > > > > > > > Apache Spark and Apache Druid, and is also integrated
> > with
> > > a
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > other open source data processing platforms such as
> > Splice
> > > > > > Machine,
> > > > > > > > > GCHQ
> > > > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that having DataSketches as an Apache project
> > > will
> > > > > > provide
> > > > > > > > > an
> > > > > > > > > > > immediate, worthwhile, and substantial contribution to
> > the
> > > > open
> > > > > > > > source
> > > > > > > > > > > community, will have a better opportunity to provide a
> > > > > meaningful
> > > > > > > > > > > contribution to both the science and engineering of
> > > sketching
> > > > > > > > > algorithms,
> > > > > > > > > > > and integrate with other Apache projects.  In addition,
> > > this
> > > > > is a
> > > > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > > > destination
> > > > > > for
> > > > > > > > > users
> > > > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > > > >
> > > > > > > > > > > == Initial Goals ==
> > > > > > > > > > >
> > > > > > > > > > > We are breaking our initial goals into short-term (2-6
> > > > months)
> > > > > > and
> > > > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > > > >
> > > > > > > > > > > Our short-term goals include:
> > > > > > > > > > >
> > > > > > > > > > > * Understanding and adapting to the Apache development
> > > > process
> > > > > > and
> > > > > > > > > > > structures.
> > > > > > > > > > >
> > > > > > > > > > > * Start refactoring codebase and move various
> > DataSketches
> > > > > > > > repositories
> > > > > > > > > > > code to Apache Git repository.
> > > > > > > > > > >
> > > > > > > > > > > * Continue development of new features, functions, and
> > > fixes.
> > > > > > > > > > >
> > > > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> > > continue
> > > > to
> > > > > > be
> > > > > > > > > > > developed and expanded.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The intermediate to long term goals include:
> > > > > > > > > > >
> > > > > > > > > > > * Completing the design and implementation of the C++
> > > > sketches
> > > > > to
> > > > > > > > > > > complement what is already available in Java, and the
> > > Python
> > > > > > wrappers
> > > > > > > > > of
> > > > > > > > > > > those C++ sketches.
> > > > > > > > > > >
> > > > > > > > > > > * Expanding the C++ build framework to include Windows
> > and
> > > > the
> > > > > > > > popular
> > > > > > > > > > > Linux variants.
> > > > > > > > > > >
> > > > > > > > > > > * Continued engagement with the scientific research
> > > community
> > > > > on
> > > > > > the
> > > > > > > > > > > development of new algorithms for computationally
> > difficult
> > > > > > problems
> > > > > > > > > that
> > > > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > > > >
> > > > > > > > > > > == Current Status ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches GitHub project has been quite
> > successful.
> > > > As
> > > > > of
> > > > > > > > this
> > > > > > > > > > > writing (Feb, 2019) the number of downloads measured by
> > the
> > > > > Nexus
> > > > > > > > > > > Repository Manager at https://oss.sonatype.org has grown
> > > by
> > > > > > nearly a
> > > > > > > > > > > factor
> > > > > > > > > > > of 10 over the past year to about 55 thousand per month.
> > > The
> > > > > > > > > > > DataSketches/sketches-core repository has about 560 stars
> > > and
> > > > > 141
> > > > > > > > > forks,
> > > > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > > > >
> > > > > > > > > > > === Development Practices ===
> > > > > > > > > > >
> > > > > > > > > > > ==== Source Control ====
> > > > > > > > > > >
> > > > > > > > > > > All of our developers have extensive experience with Git
> > > > > version
> > > > > > > > > control
> > > > > > > > > > > and follow accepted practices for use of Pull Requests
> > > (PRs),
> > > > > > code
> > > > > > > > > reviews
> > > > > > > > > > > and commits to master, for example.
> > > > > > > > > > >
> > > > > > > > > > > ==== Testing ====
> > > > > > > > > > >
> > > > > > > > > > > Sketches, by their nature are probabilistic programs and
> > > > don’t
> > > > > > > > > necessarily
> > > > > > > > > > > behave deterministically.  For some of the sketches we
> > > > > > intentionally
> > > > > > > > > insert
> > > > > > > > > > > random noise into the code as this gives us the
> > > mathematical
> > > > > > > > properties
> > > > > > > > > > > that we need to guarantee accuracy.  This can make the
> > > > behavior
> > > > > > of
> > > > > > > > > these
> > > > > > > > > > > algorithms quite unintuitive and provides significant
> > > > > challenges
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > developer who wishes to test these algorithms for
> > > > correctness.
> > > > > > As a
> > > > > > > > > result,
> > > > > > > > > > > our testing strategy includes two major components: unit
> > > > tests,
> > > > > > and
> > > > > > > > > > > characterization tests.
> > > > > > > > > > >
> > > > > > > > > > > ===== Unit Testing =====
> > > > > > > > > > >
> > > > > > > > > > > Our unit tests are primarily quick tests to make sure
> > that
> > > we
> > > > > > > > exercise
> > > > > > > > > all
> > > > > > > > > > > critical paths in the code and that key branches are
> > > executed
> > > > > > > > > correctly. It
> > > > > > > > > > > is important that they execute relatively fast as they
> > are
> > > > > > generally
> > > > > > > > > run on
> > > > > > > > > > > every code build. The sketches-core repository alone has
> > > > about
> > > > > 22
> > > > > > > > > thousand
> > > > > > > > > > > statements, over 1300 unit tests and code coverage of
> > about
> > > > > > 98.2% as
> > > > > > > > > > > measured by Atlassian/Clover.  It is our goal for all of
> > > our
> > > > > code
> > > > > > > > > > > repositories that are used in production that they have
> > > code
> > > > > > coverage
> > > > > > > > > > > greater than 90%.
> > > > > > > > > > >
> > > > > > > > > > > ===== Characterization Testing =====
> > > > > > > > > > >
> > > > > > > > > > > In order to test the probabilistic methods that are used
> > to
> > > > > > interpret
> > > > > > > > > the
> > > > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > > > characterization
> > > > > > > > > > > repository that is dedicated to this.  To measure
> > accuracy,
> > > > for
> > > > > > > > > example,
> > > > > > > > > > > requires running thousands of trials at each of many
> > > > different
> > > > > > points
> > > > > > > > > along
> > > > > > > > > > > the domain axis. Each trial compares its estimated
> > results
> > > > > > against a
> > > > > > > > > known
> > > > > > > > > > > exact result producing an error for that trial.  These
> > > error
> > > > > > > > > measurements
> > > > > > > > > > > are then fed into our Quantiles sketch to capture the
> > > actual
> > > > > > > > > distribution
> > > > > > > > > > > of error at that point along the axis. We then select
> > > > quantile
> > > > > > > > contours
> > > > > > > > > > > across all the distributions at points along the axis.
> > > These
> > > > > > > > contours
> > > > > > > > > can
> > > > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > > > distribution.
> > > > > > > > > These
> > > > > > > > > > > distributions are not at all Gaussian, in fact they can
> > be
> > > > > quite
> > > > > > > > > complex.
> > > > > > > > > > > Nonetheless, these distributions are then checked against
> > > our
> > > > > > > > > statistical
> > > > > > > > > > > guarantees inherent to the specific sketch algorithm and
> > > its
> > > > > > > > > parameters.
> > > > > > > > > > > There are many examples of these characterization error
> > > > > > distributions
> > > > > > > > > on
> > > > > > > > > > > our website. The runtimes of these tests can be very long
> > > and
> > > > > can
> > > > > > > > range
> > > > > > > > > > > from many minutes to hours, and some can run for days.
> > > > > > Currently, we
> > > > > > > > > have
> > > > > > > > > > > separate characterization repositories for Java and C++ /
> > > > > Python.
> > > > > > > > > > >
> > > > > > > > > > > It is our goal that we perform this characterization
> > > analysis
> > > > > > for all
> > > > > > > > > of
> > > > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > > > characterization
> > > > > > > > > > > tests is open-source so others can run these tests as
> > well.
> > > > We
> > > > > > do
> > > > > > > > not
> > > > > > > > > have
> > > > > > > > > > > formal releases of this code (because it is not
> > production
> > > > > code)
> > > > > > and
> > > > > > > > > it is
> > > > > > > > > > > not published to Maven Central.
> > > > > > > > > > >
> > > > > > > > > > > === Meritocracy ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches was initially developed based on
> > requirements
> > > > > within
> > > > > > > > > Yahoo. As
> > > > > > > > > > > a project on GitHub, DataSketches has received
> > > contributions
> > > > > from
> > > > > > > > > numerous
> > > > > > > > > > > individual developers from around the world, dedicated
> > > > research
> > > > > > work
> > > > > > > > > from
> > > > > > > > > > > senior scientists at Amazon and Visa, and academic
> > > > researchers
> > > > > > from
> > > > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > > > >
> > > > > > > > > > > As a project under incubation, we are committed to
> > > expanding
> > > > > our
> > > > > > > > > effort to
> > > > > > > > > > > build an environment which supports a meritocracy. We are
> > > > > > focused on
> > > > > > > > > > > engaging the community and other related projects for
> > > support
> > > > > and
> > > > > > > > > > > contributions. Moreover, we are committed to ensure
> > > > > contributors
> > > > > > and
> > > > > > > > > > > committers to DataSketches come from a broad mix of
> > > > > organizations
> > > > > > > > > through a
> > > > > > > > > > > merit-based decision process during incubation. We
> > believe
> > > > > > strongly
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > DataSketches premise that fulfills the concept of a well
> > > > > > engineered
> > > > > > > > and
> > > > > > > > > > > scientifically rigorous library that implements these
> > > > powerful
> > > > > > > > > algorithms
> > > > > > > > > > > and are committed to growing an inclusive community of
> > > > > > DataSketches
> > > > > > > > > > > contributors and users.
> > > > > > > > > > >
> > > > > > > > > > > === Community ===
> > > > > > > > > > >
> > > > > > > > > > > Yahoo has a long history and active engagement in the
> > Open
> > > > > Source
> > > > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> > > Moloch,
> > > > > > > > Panoptes,
> > > > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > > > TensorFlowOnSpark,
> > > > > > > > > gifshot,
> > > > > > > > > > > fluxible, as well as the creation, contribution and
> > > > incubation
> > > > > of
> > > > > > > > many
> > > > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > > > Oozie,
> > > > > > > > > Zookeeper,
> > > > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many
> > more.
> > > > > > > > > > >
> > > > > > > > > > > Every day, DataSketches is actively used by a
> > organizations
> > > > and
> > > > > > > > > > > institutions around the world for batch and stream
> > > processing
> > > > > of
> > > > > > > > data.
> > > > > > > > > We
> > > > > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > > > > DataSketches-related work, grow the DataSketches
> > community,
> > > > and
> > > > > > > > deepen
> > > > > > > > > > > connections between DataSketches and other open source
> > > > > projects.
> > > > > > > > > > >
> > > > > > > > > > > === Introduction to the Core Developers & Contributors
> > ===
> > > > > > > > > > >
> > > > > > > > > > > The core developers and contributors for DataSketches are
> > > > from
> > > > > > > > diverse
> > > > > > > > > > > backgrounds, but primarily are scientists that love
> > > > engineering
> > > > > > and
> > > > > > > > > > > engineers that love science. A large part of the value we
> > > > bring
> > > > > > comes
> > > > > > > > > from
> > > > > > > > > > > this synthesis.  These individuals have already
> > contributed
> > > > > > > > > substantially
> > > > > > > > > > > to the code, algorithms, and/or mathematical proofs that
> > > form
> > > > > the
> > > > > > > > > basis of
> > > > > > > > > > > the library.
> > > > > > > > > > >
> > > > > > > > > > > This core group also form the Initial Committers with
> > write
> > > > > > > > > permissions to
> > > > > > > > > > > the repository. Those marked with (*) Meet weekly to plan
> > > the
> > > > > > > > research
> > > > > > > > > and
> > > > > > > > > > > engineering direction of the project.
> > > > > > > > > > >
> > > > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > > > >
> > > > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > > > Israel.
> > > > > > > > > Interests:
> > > > > > > > > > > distributed systems, scalable systems and platforms for
> > big
> > > > > data
> > > > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > > > >
> > > > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> > > > Labs,
> > > > > > > > > Sunnyvale,
> > > > > > > > > > > California. Interests: algorithms, theoretical and
> > applied
> > > > > > > > mathematics,
> > > > > > > > > > > encoding and compression theory, theoretical and applied
> > > > > > performance
> > > > > > > > > > > optimization.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon
> > AI
> > > > > Labs,
> > > > > > Palo
> > > > > > > > > Alto,
> > > > > > > > > > > California. Manages the algorithms group at Amazon AI. We
> > > > build
> > > > > > > > > scalable
> > > > > > > > > > > machine learning systems and algorithms which are used
> > both
> > > > > > > > internally
> > > > > > > > > and
> > > > > > > > > > > externally by customers of SageMaker, AWS's flagship
> > > machine
> > > > > > learning
> > > > > > > > > > > platform.
> > > > > > > > > > >
> > > > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs,
> > Sunnyvale.
> > > > > > Interests:
> > > > > > > > > > > Computational advertising, machine learning, speech
> > > > > recognition,
> > > > > > > > > > > data-driven analysis, large scale experimentation, big
> > > data,
> > > > > > > > > stream/complex
> > > > > > > > > > > event processing
> > > > > > > > > > >
> > > > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Georgetown University, Washington D.C. Interests:
> > > algorithms
> > > > > and
> > > > > > > > > > > computational complexity, complexity theory, quantum
> > > > > algorithms,
> > > > > > > > > private
> > > > > > > > > > > data analysis, and learning theory, developing efficient
> > > > > > streaming
> > > > > > > > and
> > > > > > > > > > > sketching algorithms
> > > > > > > > > > >
> > > > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > > > >
> > > > > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets
> > /
> > > > > Snap.
> > > > > > > > > Interests:
> > > > > > > > > > > design and implementation of data storing and data
> > > processing
> > > > > > > > > (distributed)
> > > > > > > > > > > systems, performance optimization, CPU performance,
> > > > mechanical
> > > > > > > > > sympathy,
> > > > > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > > > > structures,
> > > > > > > > > > > memory management, garbage collection algorithms,
> > language
> > > > > > design and
> > > > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > > > efficiency,
> > > > > > > > > Linux,
> > > > > > > > > > > code quality, code transformation, pure functional
> > > > programming
> > > > > > > > models,
> > > > > > > > > > > Haskell.
> > > > > > > > > > >
> > > > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer
> > > and
> > > > > > founder
> > > > > > > > > of
> > > > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > > > Interests:
> > > > > > > > > > > streaming algorithms, mathematics, computer science, high
> > > > > > quality and
> > > > > > > > > high
> > > > > > > > > > > performance code for the analysis of massive data,
> > bridging
> > > > the
> > > > > > > > divide
> > > > > > > > > > > between theory and practice.
> > > > > > > > > > >
> > > > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer,
> > Yahoo,
> > > > > > Sunnyvale,
> > > > > > > > > > > California. Interests: applied mathematics, computer
> > > science,
> > > > > big
> > > > > > > > data,
> > > > > > > > > > > distributed systems.
> > > > > > > > > > >
> > > > > > > > > > > === Introduction to Additional Interested Contributors
> > ===
> > > > > > > > > > >
> > > > > > > > > > > These folks have been intermittently involved and
> > > > contributed,
> > > > > > but
> > > > > > > > are
> > > > > > > > > > > strong supporters of this project.
> > > > > > > > > > >
> > > > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > > > >
> > > > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > > > matrix
> > > > > > > > > > > approximation, streaming algorithms, randomized linear
> > > > algebra.
> > > > > > > > > > >
> > > > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> > > > Ph.D.
> > > > > > > > > Computer
> > > > > > > > > > > Science, Research Instructor, Princeton University.
> > > > Interests:
> > > > > > > > > algorithmic
> > > > > > > > > > > foundations of data science and machine learning,
> > efficient
> > > > > > methods
> > > > > > > > for
> > > > > > > > > > > processing and understanding large datasets, often
> > working
> > > at
> > > > > the
> > > > > > > > > > > intersection of theoretical computer science, numerical
> > > > linear
> > > > > > > > > algebra, and
> > > > > > > > > > > optimization.
> > > > > > > > > > >
> > > > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Professor, Warwick University, Warwick, England.
> > Interests:
> > > > all
> > > > > > > > > aspects of
> > > > > > > > > > > the "data lifecycle", from data collection and cleaning,
> > > > > through
> > > > > > > > > mining and
> > > > > > > > > > > analytics. (Professor Cormode is one of the world’s
> > leading
> > > > > > > > scientists
> > > > > > > > > in
> > > > > > > > > > > sketching algorithms)
> > > > > > > > > > >
> > > > > > > > > > > === Alignment ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library already provides integrations
> > and
> > > > > > example
> > > > > > > > > code for
> > > > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > > > integrated
> > > > > > into
> > > > > > > > > Apache
> > > > > > > > > > > Druid.
> > > > > > > > > > >
> > > > > > > > > > > == Known Risks ==
> > > > > > > > > > >
> > > > > > > > > > > The following subsections are specific risks that have
> > been
> > > > > > > > identified
> > > > > > > > > by
> > > > > > > > > > > the ASF that need to be addressed.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library is presently used by a number of
> > > > > > > > > organizations,
> > > > > > > > > > > from small startups to Fortune 100 companies, to
> > construct
> > > > > > production
> > > > > > > > > > > pipelines that must process and analyze massive data.
> > Yahoo
> > > > > has a
> > > > > > > > > long-term
> > > > > > > > > > > commitment to continue to advance the DataSketches
> > library;
> > > > > > moreover,
> > > > > > > > > > > DataSketches is seeing increasing interest, development,
> > > and
> > > > > > adoption
> > > > > > > > > from
> > > > > > > > > > > many diverse organizations from around the world. Due to
> > > its
> > > > > > growing
> > > > > > > > > > > adoption, we feel it is quite unlikely that this project
> > > > would
> > > > > > become
> > > > > > > > > > > orphaned.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > > > >
> > > > > > > > > > > Yahoo believes strongly in open source and the exchange
> > of
> > > > > > > > information
> > > > > > > > > to
> > > > > > > > > > > advance new ideas and work. Examples of this commitment
> > are
> > > > > > active
> > > > > > > > open
> > > > > > > > > > > source projects such as those mentioned above. With
> > > > > > DataSketches, we
> > > > > > > > > have
> > > > > > > > > > > been increasingly open and forward-looking; we have
> > > > published a
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > papers about breakthrough developments in the science of
> > > > > > streaming
> > > > > > > > > > > algorithms (mentioned above) that also reference the
> > > > > DataSketches
> > > > > > > > > library.
> > > > > > > > > > > Our submission to the Apache Software Foundation is a
> > > logical
> > > > > > > > > extension of
> > > > > > > > > > > our commitment to open source software.
> > > > > > > > > > >
> > > > > > > > > > > Key committers at Yahoo with strong open source
> > backgrounds
> > > > > > include
> > > > > > > > > Aaron
> > > > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > > > Braginsky,
> > > > > > > > Andrews
> > > > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen,
> > Bryan
> > > > > Call,
> > > > > > > > Daryn
> > > > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> > > > Eshcar
> > > > > > > > Hillel,
> > > > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > > > Perez-Sorrosal,
> > > > > > > > > Gil
> > > > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai
> > Asher,
> > > > > James
> > > > > > > > > Penick,
> > > > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > > > > Eagles,
> > > > > > > > > Kihwal
> > > > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla,
> > Michael
> > > > > > Trelinski,
> > > > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga
> > L.
> > > > > > > > Natkovich,
> > > > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> > > > Ruby
> > > > > > Loo,
> > > > > > > > > Ryan
> > > > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu
> > > Kit
> > > > > > Chan,
> > > > > > > > Sri
> > > > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and
> > many
> > > > > more.
> > > > > > > > > > >
> > > > > > > > > > > All of our core developers are committed to learn about
> > the
> > > > > > Apache
> > > > > > > > > process
> > > > > > > > > > > and to give back to the community.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > > > >
> > > > > > > > > > > The majority of committers in this proposal belong to
> > Yahoo
> > > > due
> > > > > > to
> > > > > > > > the
> > > > > > > > > fact
> > > > > > > > > > > that DataSketches has emerged from an internal Yahoo
> > > project.
> > > > > > This
> > > > > > > > > proposal
> > > > > > > > > > > also includes developers and contributors from other
> > > > companies,
> > > > > > and
> > > > > > > > > who are
> > > > > > > > > > > actively involved with other Apache projects, such as
> > > Druid.
> > > > > We
> > > > > > > > > expect our
> > > > > > > > > > > entry into incubation will allow us to expand the number
> > of
> > > > > > > > > individuals and
> > > > > > > > > > > organizations participating in DataSketches development.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > > > >
> > > > > > > > > > > Because the DataSketches library originated within Yahoo,
> > > it
> > > > > has
> > > > > > been
> > > > > > > > > > > developed primarily by salaried Yahoo developers and we
> > > > expect
> > > > > > that
> > > > > > > > to
> > > > > > > > > > > continue to be the case near term. However, since we
> > placed
> > > > > this
> > > > > > > > > library
> > > > > > > > > > > into open-source we have had a number of significant
> > > > > > contributions
> > > > > > > > from
> > > > > > > > > > > engineers and scientists from outside of Yahoo. We expect
> > > our
> > > > > > > > reliance
> > > > > > > > > on
> > > > > > > > > > > Yahoo salaried developers will decrease over time.
> > > > Nonetheless,
> > > > > > Yahoo
> > > > > > > > > is
> > > > > > > > > > > committed to continue its strong support of this
> > important
> > > > > > project.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Lack of Relationship to other Apache Products
> > ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches already directly interoperates with or
> > > utilizes
> > > > > > several
> > > > > > > > > > > existing Apache projects.
> > > > > > > > > > >
> > > > > > > > > > > * Build
> > > > > > > > > > >    * Apache Maven
> > > > > > > > > > >
> > > > > > > > > > > * Integrations and adaptors for the following projects
> > > > > naturally
> > > > > > have
> > > > > > > > > them
> > > > > > > > > > > as dependencies
> > > > > > > > > > >    * Apache Hive
> > > > > > > > > > >    * Apache Pig
> > > > > > > > > > >    * Apache Druid
> > > > > > > > > > >    * Apache Spark
> > > > > > > > > > >
> > > > > > > > > > > * Additional dependencies for the above integrations and
> > > > > adaptors
> > > > > > > > > include
> > > > > > > > > > >    * Apache Hadoop
> > > > > > > > > > >    * Apache Commons (Math)
> > > > > > > > > > >
> > > > > > > > > > > There is no other Apache project that we are aware of
> > that
> > > > > > duplicates
> > > > > > > > > the
> > > > > > > > > > > functionality of the DataSketches library.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: An Excessive Fascination with the Apache Brand
> > > ===
> > > > > > > > > > >
> > > > > > > > > > > With this proposal we are not seeking attention or
> > > publicity.
> > > > > > Rather,
> > > > > > > > > we
> > > > > > > > > > > firmly believe in the DataSketches library and concept
> > and
> > > > the
> > > > > > > > ability
> > > > > > > > > to
> > > > > > > > > > > make the DataSketches library a powerful, yet
> > simple-to-use
> > > > > > toolkit
> > > > > > > > for
> > > > > > > > > > > data processing. While the DataSketches library has been
> > > open
> > > > > > source,
> > > > > > > > > we
> > > > > > > > > > > believe putting code on GitHub can only go so far. We see
> > > the
> > > > > > Apache
> > > > > > > > > > > community, processes, and mission as critical for
> > ensuring
> > > > the
> > > > > > > > > DataSketches
> > > > > > > > > > > library is truly community-driven, positively impactful,
> > > and
> > > > > > > > innovative
> > > > > > > > > > > open source software. While Yahoo has taken a number of
> > > steps
> > > > > to
> > > > > > > > > advance
> > > > > > > > > > > its various open source projects, we believe the
> > > DataSketches
> > > > > > library
> > > > > > > > > > > project is a great fit for the Apache Software Foundation
> > > due
> > > > > to
> > > > > > its
> > > > > > > > > focus
> > > > > > > > > > > on data processing and its relationships to existing ASF
> > > > > > projects.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Cryptography ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches does not contain any cryptographic code and
> > is
> > > > > not a
> > > > > > > > > > > cryptographic product.
> > > > > > > > > > >
> > > > > > > > > > > == Documentation ==
> > > > > > > > > > >
> > > > > > > > > > > The following documentation is relevant to this proposal.
> > > > > > Relevant
> > > > > > > > > portions
> > > > > > > > > > > of the documentation will be contributed to the Apache
> > > > > > DataSketches
> > > > > > > > > > > project.
> > > > > > > > > > >
> > > > > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > > > > >
> > > > > > > > > > > * DataSketches website repository:
> > > > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > > > >
> > > > > > > > > > > We will need an apache website for this documentation
> > > similar
> > > > > to
> > > > > > > > > > >
> > > > > > > > > > > * https://datasketches.apache.org
> > > > > > > > > > >
> > > > > > > > > > > == Initial Source ==
> > > > > > > > > > >
> > > > > > > > > > > The initial source for DataSketches which we will submit
> > to
> > > > the
> > > > > > > > Apache
> > > > > > > > > > > Foundation will include a number of repositories which
> > are
> > > > > > currently
> > > > > > > > > hosted
> > > > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > > > >
> > > > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > > > >
> > > > > > > > > > > * Java
> > > > > > > > > > >    * sketches-core: This repository has the core
> > sketching
> > > > > > classes,
> > > > > > > > > which
> > > > > > > > > > > are leveraged by some of the other repositories. This
> > > > > repository
> > > > > > has
> > > > > > > > no
> > > > > > > > > > > external dependencies outside of the DataSketches/memory
> > > > > > repository,
> > > > > > > > > Java
> > > > > > > > > > > and TestNG for unit tests. This code is versioned and the
> > > > > latest
> > > > > > > > > release
> > > > > > > > > > > can be obtained from Maven Central.
> > > > > > > > > > >    * memory: Low level, high-performance memory
> > > > data-structure
> > > > > > > > > management
> > > > > > > > > > > primarily for off-heap.
> > > > > > > > > > >    * sketches-android: This is a new repository dedicated
> > > to
> > > > > > sketches
> > > > > > > > > > > designed to be run in a mobile client, such as a cell
> > > phone.
> > > > It
> > > > > > is
> > > > > > > > > still in
> > > > > > > > > > > development and should be considered experimental.
> > > > > > > > > > >    * sketches-hive: This repository contains Hive UDFs
> > and
> > > > > UDAFs
> > > > > > for
> > > > > > > > > use
> > > > > > > > > > > within Hadoop grid environments. This code has
> > dependencies
> > > > on
> > > > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> > > code
> > > > > are
> > > > > > > > > advised to
> > > > > > > > > > > use Maven to bring in all the required dependencies. This
> > > > code
> > > > > is
> > > > > > > > > versioned
> > > > > > > > > > > and the latest release can be obtained from Maven
> > Central.
> > > > > > > > > > >    * sketches-pig: This repository contains Pig User
> > > Defined
> > > > > > > > Functions
> > > > > > > > > > > (UDF) for use within Hadoop grid environments. This code
> > > has
> > > > > > > > > dependencies
> > > > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> > > > code
> > > > > > are
> > > > > > > > > advised
> > > > > > > > > > > to use Maven to bring in all the required dependencies.
> > > This
> > > > > > code is
> > > > > > > > > > > versioned and the latest release can be obtained from
> > Maven
> > > > > > Central.
> > > > > > > > > > >    * sketches-vector: This is a new repository dedicated
> > to
> > > > > > sketches
> > > > > > > > > for
> > > > > > > > > > > vector and matrix operations. It is still somewhat
> > > > > experimental.
> > > > > > > > > > >    * characterization: This relatively new repository is
> > > for
> > > > > code
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > use to characterize the accuracy and speed performance of
> > > the
> > > > > > > > sketches
> > > > > > > > > in
> > > > > > > > > > > the library and is constantly being updated. Examples of
> > > the
> > > > > job
> > > > > > > > > command
> > > > > > > > > > > files used for various tests can be found in the
> > > > > > src/main/resources
> > > > > > > > > > > directory. Some of these tests can run for hours
> > depending
> > > on
> > > > > its
> > > > > > > > > > > configuration.
> > > > > > > > > > >    * experimental: This repository is an experimental
> > > staging
> > > > > > area
> > > > > > > > for
> > > > > > > > > code
> > > > > > > > > > > that will eventually end up in another repository. This
> > > code
> > > > is
> > > > > > not
> > > > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > > > production
> > > > > > > > > > > deployment
> > > > > > > > > > >
> > > > > > > > > > > * C++ and Python
> > > > > > > > > > >    * sketches-core-cpp: This is the C++/Python companion
> > to
> > > > the
> > > > > > Java
> > > > > > > > > > > sketches-core. These implementations are binary
> > compatible
> > > > with
> > > > > > their
> > > > > > > > > > > counterparts in Java. In other words, a sketch created
> > and
> > > > > > stored in
> > > > > > > > > C++
> > > > > > > > > > > can be opened and read in Java and visa-versa. This site
> > > also
> > > > > > has our
> > > > > > > > > > > Python adaptors that basically wrap the C++
> > > implementations,
> > > > > > making
> > > > > > > > the
> > > > > > > > > > > high performance C++ implementations available from
> > Python.
> > > > > > > > > > >    * sketches-postgres: This site provides the
> > > > > postgres-specific
> > > > > > > > > adaptors
> > > > > > > > > > > that wrap the C++ implementations making them available
> > to
> > > > the
> > > > > > > > Postgres
> > > > > > > > > > > database users.
> > > > > > > > > > >    * characterization-cpp: This is the C++/Python
> > companion
> > > > to
> > > > > > the
> > > > > > > > Java
> > > > > > > > > > > characterization repository.
> > > > > > > > > > >    * experimental-cpp: This repository is an experimental
> > > > > staging
> > > > > > > > area
> > > > > > > > > for
> > > > > > > > > > > C++ code that will eventually end up in another
> > repository.
> > > > > > > > > > >
> > > > > > > > > > > * Command-Line Tools
> > > > > > > > > > >    * sketches-cmd
> > > > > > > > > > >    * homebrew-sketches
> > > > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > > > >
> > > > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > > > intend
> > > > > to
> > > > > > > > > bundle
> > > > > > > > > > > all of these repositories since they are all
> > complementary
> > > > and
> > > > > > should
> > > > > > > > > be
> > > > > > > > > > > maintained in one project. Prior to our submission, we
> > will
> > > > > > combine
> > > > > > > > > all of
> > > > > > > > > > > these projects into a new git repository.
> > > > > > > > > > >
> > > > > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > > > > >
> > > > > > > > > > > Contributors to the DataSketches project have also signed
> > > the
> > > > > > Yahoo
> > > > > > > > > > > Individual Contributor License Agreement (
> > > > > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > > > > in order to contribute to the project.
> > > > > > > > > > >
> > > > > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > > > > trademark on
> > > > > > > > > the
> > > > > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > > > > receive
> > > > > > > > > during the
> > > > > > > > > > > incubation process, we are open to renaming the project
> > if
> > > > > > necessary
> > > > > > > > > for
> > > > > > > > > > > trademark or other concerns, but we would prefer not to
> > > have
> > > > to
> > > > > > do
> > > > > > > > > that.
> > > > > > > > > > >
> > > > > > > > > > > == External Dependencies ==
> > > > > > > > > > >
> > > > > > > > > > > All external dependencies are licensed under an Apache
> > 2.0
> > > or
> > > > > > > > > > > Apache-compatible license. As we grow the DataSketches
> > > > > community
> > > > > > we
> > > > > > > > > will
> > > > > > > > > > > configure our build process to require and validate all
> > > > > > contributions
> > > > > > > > > and
> > > > > > > > > > > dependencies are licensed under the Apache 2.0 license or
> > > are
> > > > > > under
> > > > > > > > an
> > > > > > > > > > > Apache-compatible license.
> > > > > > > > > > >
> > > > > > > > > > > == Required Resources ==
> > > > > > > > > > >
> > > > > > > > > > > === Mailing Lists ===
> > > > > > > > > > >
> > > > > > > > > > > We currently use a mix of mailing lists. We will migrate
> > > our
> > > > > > existing
> > > > > > > > > > > mailing lists to the following:
> > > > > > > > > > >
> > > > > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > > > > >
> > > > > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > > > > >
> > > > > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > > > > >
> > > > > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > > > > >
> > > > > > > > > > > === Source Control ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches team currently uses Git and would like
> > to
> > > > > > continue
> > > > > > > > to
> > > > > > > > > do
> > > > > > > > > > > so. We request a Git repository for DataSketches with
> > > > mirroring
> > > > > > to
> > > > > > > > > GitHub
> > > > > > > > > > > enabled similar the following:
> > > > > > > > > > >
> > > > > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > > > > >
> > > > > > > > > > > === Issue Tracking ===
> > > > > > > > > > >
> > > > > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > > > > DataSketches
> > > > > > > > > project
> > > > > > > > > > > is currently using the public GitHub issue tracker and
> > the
> > > > > public
> > > > > > > > > Google
> > > > > > > > > > > Groups forum/sketches-user for issue tracking and
> > > > discussions.
> > > > > We
> > > > > > > > will
> > > > > > > > > > > migrate and combine from these two sources to the Apache
> > > > JIRA.
> > > > > > > > > > >
> > > > > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > > > > >
> > > > > > > > > > > == Initial Committers ==
> > > > > > > > > > >
> > > > > > > > > > > The following list of individuals have been extremely
> > > active
> > > > in
> > > > > > our
> > > > > > > > > > > community and should have write (commit) permissions to
> > the
> > > > > > > > repository.
> > > > > > > > > > >
> > > > > > > > > > > * Eshcar Hillel                      [eshcar at
> > > verizonmedia
> > > > > dot
> > > > > > com]
> > > > > > > > > > >
> > > > > > > > > > > * Kevin Lang                    [langk at verizonmedia
> > dot
> > > > com]
> > > > > > > > > > >
> > > > > > > > > > > * Roman Leventov              [roman.leventov at
> > > > c.metamarkets
> > > > > > dot
> > > > > > > > com]
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty                   [libertye at amazon dot
> > > com]
> > > > > > > > > > >
> > > > > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia
> > > dot
> > > > > com]
> > > > > > > > > > >
> > > > > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia
> > dot
> > > > > com] &
> > > > > > > > > [leerho
> > > > > > > > > > > at gmail dot com]
> > > > > > > > > > >
> > > > > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia
> > dot
> > > > com]
> > > > > > > > > > >
> > > > > > > > > > > * Justin Thaler                 [justin.thaler at
> > > georgetown
> > > > > dot
> > > > > > edu]
> > > > > > > > > > >
> > > > > > > > > > > == Affiliations ==
> > > > > > > > > > >
> > > > > > > > > > > The initial committers are from four organizations:
> > Yahoo,
> > > > > > Amazon,
> > > > > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > > > > >
> > > > > > > > > > > === Champion ===
> > > > > > > > > > > (Recommended to me: )
> > > > > > > > > > >
> > > > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > > > [chenliang613
> > > > > at
> > > > > > > > > apache
> > > > > > > > > > > dot org]
> > > > > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > > > > >
> > > > > > > > > > > === Nominated Mentors ===
> > > > > > > > > > > (Recommended to me: )
> > > > > > > > > > >
> > > > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > > > [chenliang613
> > > > > at
> > > > > > > > > apache
> > > > > > > > > > > dot org]
> > > > > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > > > > >
> > > > > > > > > > > === Sponsoring Entity ===
> > > > > > > > > > >
> > > > > > > > > > > * The Apache Incubator    **** This is our 1st choice
> > ****
> > > > > > > > > > >
> > > > > > > > > > > * Apache Druid. The incubating Apache Druid project might
> > > > also
> > > > > > be a
> > > > > > > > > logical
> > > > > > > > > > > sponsor. However, DataSketches has applications in many
> > > areas
> > > > > of
> > > > > > > > > computing
> > > > > > > > > > > outside of Druid so our preference and recommendation is
> > > that
> > > > > > > > > DataSketches
> > > > > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > > > > >
> > > > > > > > > > > ________________
> > > > > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > > > > previously
> > > > > > > > > acquired
> > > > > > > > > > > AOL. The merged entity was originally called Oath, Inc.,
> > > but
> > > > > has
> > > > > > > > > recently
> > > > > > > > > > > been renamed Verizon Media, Inc., a wholly-owned
> > subsidiary
> > > > of
> > > > > > > > Verizon,
> > > > > > > > > > > Inc.  Since Yahoo is the more recognized name, references
> > > in
> > > > > this
> > > > > > > > > document
> > > > > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > > > > kenn@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > The subject line has me interested already. Follow
> > > examples
> > > > > > like
> > > > > > > > this
> > > > > > > > > > > > maybe?
> > > > > > > > > > > >
> > > > > > > > > > > > 1.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > > > 2.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > > >
> > > > > > > > > > > > Kenn
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <
> > leerho@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I'll try again ... :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > > > > ted.dunning@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> It didn't make it again
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <
> > > leerho@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > I'm not sure the attached document made it
> > through.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > > > > leerho@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > >
> > > > > ---------------------------------------------------------------------
> > > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > > > > For additional commands, e-mail:
> > > > > > > > general-help@incubator.apache.org
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > general-unsubscribe@incubator.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > > general-help@incubator.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > --
> > > > > > > From my cell phone.
> > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > --
> > From my cell phone.
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: DataSketches Proposal - Google Docs Link

Posted by Kenneth Knowles <ke...@apache.org>.

It isn't too much work, so I've done it:
https://s.apache.org/datasketches-proposal-draft

Kenn

On Mon, Feb 25, 2019 at 9:31 PM leerho <le...@gmail.com> wrote:

> Yes, I thought of that.  But it’s not like I’m being overwhelmed with
> requests to comment ... so far it has been only 3 or 4, and the requested
> changes have been minor.  I’m assuming that if there are no more
> substantive changes after this week that the document would be moved to the
> wiki archive, where, I presume, changes could still be made.
>
> I want to do the right thing here, so if you feel that the document would
> get much better feedback on an unrestricted gDoc site, I will set it up.
>
>
>
> On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jb...@cloudera.com.invalid>
> wrote:
>
> > You could use a Google account that is not under Yahoo’s control, then
> let
> > anyone in the world add a comment, maybe.
> >
> > On Mon, Feb 25, 2019 at 3:26 PM leerho <le...@gmail.com> wrote:
> >
> > > Ken,
> > > Yahoo does not allow me to create a shared link outside our company,
> > except
> > > to individual email addresses.  So attempting to share it to the email
> > > general@incubator.apache.org may not work.  Nonetheless, several
> > > individuals were able to request access using their individual email
> > > accounts and I was able to add them.  I will try to add you using
> > > kenn@apache.org, but if that doesn't work, I may need a gmail or
> > > equivalent
> > > account for you.
> > >
> > > Lee.
> > >
> > >
> > > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> > >
> > > > I could not access that document. I suggest you need to turn on link
> > > > sharing.
> > > >
> > > > Kenn
> > > >
> > > > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> > > > wrote:
> > > >
> > > > > Try this link:
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > > >
> > > > >
> > > > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > > > Yes I will try that tomorrow.
> > > > > >
> > > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <kenn@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > > Can you share the Google doc with the proposal? Per Ted's
> advice,
> > > we
> > > > > can
> > > > > > > iterate quickly there and move it to the wiki when it becomes a
> > bit
> > > > > more
> > > > > > > stable.
> > > > > > >
> > > > > > > Kenn
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> > > leerho@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the offer.  i am a neophyte at this process and
> > email
> > > > > app!   I
> > > > > > > > could use a lot of help getting this off the ground!  Also,
> I'm
> > > not
> > > > > sure
> > > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this
> on
> > > :)
> > > > > > > >
> > > > > > > > Lee.
> > > > > > > >
> > > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org>
> > wrote:
> > > > > > > > > Nice.
> > > > > > > > >
> > > > > > > > > I would very much like to help mentor this project, though
> > you
> > > > > already
> > > > > > > > have
> > > > > > > > > a couple good ones.
> > > > > > > > >
> > > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > > >
> > > > > > > > > Kenn (VP Apache Beam)
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > I didn't realize that this mail list does not accept PDF
> > > files,
> > > > > > > > apparently
> > > > > > > > > > only text.  So let me try one more time ... :)  Please
> let
> > me
> > > > > know if
> > > > > > > > > > this works!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > > >
> > > > > > > > > > == Abstract ==
> > > > > > > > > >
> > > > > > > > > > DataSketches.GitHub.io is an open source,
> high-performance
> > > > > library
> > > > > > > of
> > > > > > > > > > stochastic streaming algorithms commonly called
> "sketches"
> > in
> > > > the
> > > > > > > data
> > > > > > > > > > sciences. Sketches are small, stateful programs that
> > process
> > > > > massive
> > > > > > > > data
> > > > > > > > > > as a stream and can provide approximate answers, with
> > > > > mathematical
> > > > > > > > > > guarantees, to computationally difficult queries
> > > > > orders-of-magnitude
> > > > > > > > faster
> > > > > > > > > > than traditional, exact methods.
> > > > > > > > > >
> > > > > > > > > > This proposal is to move DataSketches to the Apache
> > Software
> > > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > > intellectual
> > > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > > officially
> > > > > > > > known as
> > > > > > > > > > Apache DataSketches and its evolution and governance
> would
> > > come
> > > > > under
> > > > > > > > the
> > > > > > > > > > rules and guidance of the ASF.
> > > > > > > > > >
> > > > > > > > > > == Introduction ==
> > > > > > > > > >
> > > > > > > > > > The DataSketches library contains carefully crafted
> > > > > implementations
> > > > > > > of
> > > > > > > > > > sketch algorithms that meet rigorous standards of quality
> > and
> > > > > > > > performance
> > > > > > > > > > and provide capabilities required for large-scale
> > production
> > > > > systems
> > > > > > > > that
> > > > > > > > > > must process and analyze massive data. The DataSketches
> > core
> > > > > > > > repository is
> > > > > > > > > > written in Java with a parallel core repository written
> in
> > > C++
> > > > > that
> > > > > > > > > > includes Python wrappers. The DataSketches library also
> > > > includes
> > > > > > > > special
> > > > > > > > > > repositories for extending the core library for Apache
> Hive
> > > and
> > > > > > > Apache
> > > > > > > > Pig.
> > > > > > > > > > The sketches developed in the different languages share a
> > > > common
> > > > > > > binary
> > > > > > > > > > storage format so that sketches created and stored in
> Java,
> > > for
> > > > > > > > example,
> > > > > > > > > > can be fully used in C++, and visa versa.  Because the
> > stored
> > > > > sketch
> > > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > > images),
> > > > > they
> > > > > > > > can
> > > > > > > > > > be shared across many different systems, languages and
> > > > platforms.
> > > > > > > > > >
> > > > > > > > > > The DataSketches documentation website,
> > > > > > > https://datasketches.github.io
> > > > > > > > ,
> > > > > > > > > > includes general tutorials, a comprehensive research
> > section
> > > > with
> > > > > > > > > > references to relevant academic papers, extensive
> examples
> > > for
> > > > > using
> > > > > > > > the
> > > > > > > > > > core library directly as well as examples for accessing
> the
> > > > > library
> > > > > > > in
> > > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > > >
> > > > > > > > > > The DataSketches library also includes a characterization
> > > > > repository
> > > > > > > > for
> > > > > > > > > > long running test programs that are used for studying
> > > accuracy
> > > > > and
> > > > > > > > > > performance of these sketches over wide ranges of input
> > > > > variables.
> > > > > > > The
> > > > > > > > data
> > > > > > > > > > produced by these programs is used for generating the
> many
> > > > > > > performance
> > > > > > > > > > plots contained in the documentation website and for
> > academic
> > > > > > > > > > publications.
> > > > > > > > > >
> > > > > > > > > > The code repositories used for production are versioned
> and
> > > > > published
> > > > > > > > to
> > > > > > > > > > Maven Central on periodic intervals as the library
> evolves.
> > > > > > > > > >
> > > > > > > > > > The DataSketches library also includes several
> experimental
> > > > > > > > repositories
> > > > > > > > > > for use-cases outside the large-scale systems
> environments,
> > > > such
> > > > > as
> > > > > > > > > > sketches for mobile, IoT devices (Android), command-line
> > > access
> > > > > of
> > > > > > > the
> > > > > > > > > > sketch library, and an experimental repository for
> > > vector-based
> > > > > > > > sketches
> > > > > > > > > > that performs approximate Singular Value Decomposition
> > (SVD)
> > > > > analysis
> > > > > > > > that
> > > > > > > > > > could potentially be used in Machine Learning (ML)
> > > > applications.
> > > > > > > > > >
> > > > > > > > > > == Background ==
> > > > > > > > > >
> > > > > > > > > > The DataSketches library was started in 2012 as internal
> > > Yahoo
> > > > > > > project
> > > > > > > > to
> > > > > > > > > > dramatically reduce time and resources required for
> > distinct
> > > > > (unique)
> > > > > > > > > > counting.  An extensive search on the Internet at the
> time
> > > > > yielded a
> > > > > > > > number
> > > > > > > > > > of theoretical papers on stochastic streaming algorithms
> > with
> > > > > > > > pseudocode
> > > > > > > > > > examples, but we did not find any usable open-source code
> > of
> > > > the
> > > > > > > > quality we
> > > > > > > > > > felt we needed for our internal production systems.  So
> we
> > > > > started a
> > > > > > > > small
> > > > > > > > > > project (one person) to develop our own sketches working
> > > > directly
> > > > > > > from
> > > > > > > > > > published theoretical papers.
> > > > > > > > > >
> > > > > > > > > > The DataSketches library was designed from the start with
> > the
> > > > > > > > objective of
> > > > > > > > > > making these algorithms, usually only described in
> > > theoretical
> > > > > > > papers,
> > > > > > > > > > easily accessible to systems developers for use in our
> > > internal
> > > > > > > > production
> > > > > > > > > > systems. By necessity, the code had to be of the highest
> > > > quality
> > > > > and
> > > > > > > > > > thoroughly tested. The wide variety of our internal
> > > production
> > > > > > > systems
> > > > > > > > > > drove the requirement that the sketch implementations had
> > to
> > > > > have an
> > > > > > > > > > absolute minimum of external, run-time dependencies in
> > order
> > > to
> > > > > > > > simplify
> > > > > > > > > > integration and troubleshooting.
> > > > > > > > > >
> > > > > > > > > > Our internal experiments demonstrated dramatic positive
> > > impact
> > > > > on the
> > > > > > > > > > performance of our systems.  As a result, the
> DataSketches
> > > > > library
> > > > > > > > quickly
> > > > > > > > > > evolved to include different types of sketches for
> > different
> > > > > types of
> > > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > > algorithms,
> > > > > > > > > > quantile/histogram algorithms, and weighted and
> unweighted
> > > > > sampling
> > > > > > > > > > algorithms.
> > > > > > > > > >
> > > > > > > > > > We quickly discovered that developing these sketch
> > algorithms
> > > > to
> > > > > be
> > > > > > > > truly
> > > > > > > > > > robust in production environments is quite difficult and
> > > > requires
> > > > > > > deep
> > > > > > > > > > understanding of the underlying mathematics and
> statistics
> > as
> > > > > well as
> > > > > > > > > > extensive experience in developing high quality code for
> > 24/7
> > > > > > > > production
> > > > > > > > > > systems. This is a difficult combination of skills for
> any
> > > one
> > > > > > > > organization
> > > > > > > > > > to collect and maintain over time. It became clear that
> > this
> > > > > > > technology
> > > > > > > > > > needed a community larger than Yahoo to evolve.  In
> > November,
> > > > > 2015,
> > > > > > > > this
> > > > > > > > > > factor, along with Yahoo’s strong experience and support
> of
> > > > open
> > > > > > > > source,
> > > > > > > > > > led to the decision to open source this technology under
> an
> > > > > Apache
> > > > > > > 2.0
> > > > > > > > > > license on GitHub. Since that time our community has
> > expanded
> > > > > > > > considerably
> > > > > > > > > > and the key contributors to this effort includes leading
> > > > research
> > > > > > > > > > scientists from a number of universities as well as
> > > > > practitioners and
> > > > > > > > > > researchers from a number of major corporations. The core
> > of
> > > > this
> > > > > > > > group is
> > > > > > > > > > very active as we meet weekly to discuss research
> > directions
> > > > and
> > > > > > > > > > engineering priorities.
> > > > > > > > > >
> > > > > > > > > > It is important to note that our internal systems at
> Yahoo
> > > use
> > > > > the
> > > > > > > > current
> > > > > > > > > > public GitHub open source DataSketches library and not an
> > > > > internal
> > > > > > > > version
> > > > > > > > > > of the code.
> > > > > > > > > >
> > > > > > > > > > The close collaboration of scientific research and
> > > engineering
> > > > > > > > development
> > > > > > > > > > experience with actual massive-data processing systems
> has
> > > also
> > > > > > > > produced
> > > > > > > > > > new research publications in the field of stochastic
> > > streaming
> > > > > > > > algorithms,
> > > > > > > > > > for example:
> > > > > > > > > >
> > > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo
> Liberty,
> > > Lee
> > > > > > > > Rhodes, and
> > > > > > > > > > Justin Thaler. A high-performance algorithm for
> identifying
> > > > > frequent
> > > > > > > > items
> > > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > > >
> > > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > > > Thaler. A
> > > > > > > > > > framework for estimating stream expression cardinalities.
> > In
> > > > > > > *EDBT/ICDT
> > > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > > >
> > > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > > > Frequent
> > > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > > Proceedings
> > > > > > > > ‘16,
> > > > > > > > > > pages 845-854, 2016.
> > > > > > > > > >
> > > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty.
> Optimal
> > > > > quantile
> > > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> > pages
> > > > > 71–78,
> > > > > > > > 2016.
> > > > > > > > > >
> > > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > > optimal
> > > > > > > > cardinality
> > > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > > 2017.
> > > > > > > > > >
> > > > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching.
> > In
> > > > ACM
> > > > > KDD
> > > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > > >
> > > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > > Jonathan
> > > > > > > > Ullman.
> > > > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> > > PODS
> > > > > > > > Proceedings
> > > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > > >
> > > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin
> Thaler.
> > > > > > > Hierarchical
> > > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> > ALENEX
> > > > > > > > Proceedings
> > > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > > >
> > > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > > >
> > > > > > > > > > In the analysis of big data there are often problem
> queries
> > > > that
> > > > > > > don’t
> > > > > > > > > > scale because they require huge compute resources and
> time
> > to
> > > > > > > generate
> > > > > > > > > > exact results. Examples include count distinct,
> quantiles,
> > > most
> > > > > > > > frequent
> > > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > > >
> > > > > > > > > > If we can loosen the requirement of “exact” results from
> > our
> > > > > queries
> > > > > > > > and be
> > > > > > > > > > satisfied with approximate results, within some well
> > > understood
> > > > > > > bounds
> > > > > > > > of
> > > > > > > > > > error, there is an entire branch of mathematics and data
> > > > science
> > > > > that
> > > > > > > > has
> > > > > > > > > > evolved around developing algorithms that can produce
> > > > approximate
> > > > > > > > results
> > > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > > >
> > > > > > > > > > With the additional requirements that these algorithms
> must
> > > be
> > > > > small
> > > > > > > > > > (compared to the size of the input data), sublinear (the
> > size
> > > > of
> > > > > the
> > > > > > > > sketch
> > > > > > > > > > must grow at a slower rate than the size of the input
> > > stream),
> > > > > > > > streaming
> > > > > > > > > > (they can only touch each data item once), and mergeable
> > > > > (suitable
> > > > > > > for
> > > > > > > > > > distributed processing), defines a class of algorithms
> that
> > > can
> > > > > be
> > > > > > > > > > described as small, stochastic, streaming, sublinear
> > > mergeable
> > > > > > > > algorithms,
> > > > > > > > > > commonly called sketches (they also have other names, but
> > we
> > > > > will use
> > > > > > > > the
> > > > > > > > > > term sketches from here on).
> > > > > > > > > >
> > > > > > > > > > To be truly streaming and be able to process data in a
> > single
> > > > > pass,
> > > > > > > > > > sketches must make absolute minimum assumptions about the
> > > input
> > > > > > > stream.
> > > > > > > > > > This is critically important, as there is no “second
> > chance”
> > > to
> > > > > > > > process the
> > > > > > > > > > data.
> > > > > > > > > >
> > > > > > > > > > For example, sketches should not make assumptions about
> the
> > > > > order of
> > > > > > > > stream
> > > > > > > > > > items, the stream length, the dynamic range of values, or
> > the
> > > > > > > > distribution
> > > > > > > > > > of item occurrence frequencies. Sketches should be
> tolerant
> > > of
> > > > > NaNs,
> > > > > > > > Nulls
> > > > > > > > > > and empty objects. About the only thing that the sketch
> > needs
> > > > to
> > > > > know
> > > > > > > > about
> > > > > > > > > > the stream is how to extract items from it and what type
> > the
> > > > > item is,
> > > > > > > > e.g.,
> > > > > > > > > > is it a numeric value or a string.
> > > > > > > > > >
> > > > > > > > > > As far as the sketch is concerned, the input stream is a
> > > > > sequence of
> > > > > > > > items
> > > > > > > > > > in some unknown random order with unknown random values.
> > > > > > > > > >
> > > > > > > > > > The sketch is essentially a complex state machine and
> > > combined
> > > > > with
> > > > > > > the
> > > > > > > > > > random input stream defines a stochastic process. We then
> > > apply
> > > > > > > > > > probabilistic methods to interpret the states of the
> > > stochastic
> > > > > > > > process in
> > > > > > > > > > order to extract useful information about the input
> stream
> > > > > itself.
> > > > > > > The
> > > > > > > > > > resulting information will be approximate, but we also
> use
> > > > > additional
> > > > > > > > > > probabilistic methods to extract an estimate of the
> likely
> > > > > > > probability
> > > > > > > > > > distribution of error.
> > > > > > > > > >
> > > > > > > > > > There is a significant scientific contribution here that
> is
> > > > > defining
> > > > > > > > the
> > > > > > > > > > state machine, understanding the resulting stochastic
> > > process,
> > > > > > > > developing
> > > > > > > > > > the probabilistic methods, and proving mathematically,
> that
> > > it
> > > > > all
> > > > > > > > works!
> > > > > > > > > > This is why the scientific contributors to this project
> > are a
> > > > > > > critical
> > > > > > > > and
> > > > > > > > > > strategic component to our success.  The development
> > > engineers
> > > > > > > > translate
> > > > > > > > > > the concepts of the proposed state machine and
> > probabilistic
> > > > > methods
> > > > > > > > into
> > > > > > > > > > production-quality code. Even more important, they work
> > > closely
> > > > > with
> > > > > > > > the
> > > > > > > > > > scientists, feeding back system and user requirements,
> > which
> > > > > leads
> > > > > > > not
> > > > > > > > only
> > > > > > > > > > to superior product design, but to new science as well.
> A
> > > > > number of
> > > > > > > > > > scientific papers our members have published (see above)
> > is a
> > > > > direct
> > > > > > > > result
> > > > > > > > > > of this close collaboration.
> > > > > > > > > >
> > > > > > > > > > Because sketches are small they can be processed
> extremely
> > > > fast,
> > > > > > > often
> > > > > > > > many
> > > > > > > > > > orders-of-magnitude faster than traditional exact
> > > computations.
> > > > > For
> > > > > > > > > > interactive queries there may not be other viable
> > > alternatives,
> > > > > and
> > > > > > > in
> > > > > > > > the
> > > > > > > > > > case of real-time analysis, sketches are the only known
> > > > solution.
> > > > > > > > > >
> > > > > > > > > > For any system that needs to extract useful information
> > from
> > > > > massive
> > > > > > > > data
> > > > > > > > > > sketches are essential tools that should be tightly
> > > integrated
> > > > > into
> > > > > > > the
> > > > > > > > > > system’s analysis capabilities. This technology has
> helped
> > > > Yahoo
> > > > > > > > > > successfully reduce data processing times from days to
> > hours
> > > or
> > > > > > > > minutes on
> > > > > > > > > > a number of its internal platforms and has enabled
> > subsecond
> > > > > queries
> > > > > > > on
> > > > > > > > > > real-time platforms that would have been infeasible
> without
> > > > > sketches.
> > > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > > Other open source implementations of sketch algorithms
> can
> > be
> > > > > found
> > > > > > > on
> > > > > > > > the
> > > > > > > > > > Internet. However, we have not yet found any open source
> > > > > > > > implementations
> > > > > > > > > > that are as comprehensive, engineered with the quality
> > > required
> > > > > for
> > > > > > > > > > production systems, and with usable and guaranteed error
> > > > > properties.
> > > > > > > > Large
> > > > > > > > > > Internet companies, such as Google and Facebook, have
> > > published
> > > > > > > papers
> > > > > > > > on
> > > > > > > > > > sketching, however, their implementations of their
> > published
> > > > > > > > algorithms are
> > > > > > > > > > proprietary and not available as open source.
> > > > > > > > > >
> > > > > > > > > > The DataSketches library already provides integrations
> > with a
> > > > > number
> > > > > > > of
> > > > > > > > > > major Apache data processing platforms such as Apache
> Hive,
> > > > > Apache
> > > > > > > Pig,
> > > > > > > > > > Apache Spark and Apache Druid, and is also integrated
> with
> > a
> > > > > number
> > > > > > > of
> > > > > > > > > > other open source data processing platforms such as
> Splice
> > > > > Machine,
> > > > > > > > GCHQ
> > > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > > >
> > > > > > > > > > We believe that having DataSketches as an Apache project
> > will
> > > > > provide
> > > > > > > > an
> > > > > > > > > > immediate, worthwhile, and substantial contribution to
> the
> > > open
> > > > > > > source
> > > > > > > > > > community, will have a better opportunity to provide a
> > > > meaningful
> > > > > > > > > > contribution to both the science and engineering of
> > sketching
> > > > > > > > algorithms,
> > > > > > > > > > and integrate with other Apache projects.  In addition,
> > this
> > > > is a
> > > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > > destination
> > > > > for
> > > > > > > > users
> > > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > > >
> > > > > > > > > > == Initial Goals ==
> > > > > > > > > >
> > > > > > > > > > We are breaking our initial goals into short-term (2-6
> > > months)
> > > > > and
> > > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > > >
> > > > > > > > > > Our short-term goals include:
> > > > > > > > > >
> > > > > > > > > > * Understanding and adapting to the Apache development
> > > process
> > > > > and
> > > > > > > > > > structures.
> > > > > > > > > >
> > > > > > > > > > * Start refactoring codebase and move various
> DataSketches
> > > > > > > repositories
> > > > > > > > > > code to Apache Git repository.
> > > > > > > > > >
> > > > > > > > > > * Continue development of new features, functions, and
> > fixes.
> > > > > > > > > >
> > > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> > continue
> > > to
> > > > > be
> > > > > > > > > > developed and expanded.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The intermediate to long term goals include:
> > > > > > > > > >
> > > > > > > > > > * Completing the design and implementation of the C++
> > > sketches
> > > > to
> > > > > > > > > > complement what is already available in Java, and the
> > Python
> > > > > wrappers
> > > > > > > > of
> > > > > > > > > > those C++ sketches.
> > > > > > > > > >
> > > > > > > > > > * Expanding the C++ build framework to include Windows
> and
> > > the
> > > > > > > popular
> > > > > > > > > > Linux variants.
> > > > > > > > > >
> > > > > > > > > > * Continued engagement with the scientific research
> > community
> > > > on
> > > > > the
> > > > > > > > > > development of new algorithms for computationally
> difficult
> > > > > problems
> > > > > > > > that
> > > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > > >
> > > > > > > > > > == Current Status ==
> > > > > > > > > >
> > > > > > > > > > The DataSketches GitHub project has been quite
> successful.
> > > As
> > > > of
> > > > > > > this
> > > > > > > > > > writing (Feb, 2019) the number of downloads measured by
> the
> > > > Nexus
> > > > > > > > > > Repository Manager at https://oss.sonatype.org has grown
> > by
> > > > > nearly a
> > > > > > > > > > factor
> > > > > > > > > > of 10 over the past year to about 55 thousand per month.
> > The
> > > > > > > > > > DataSketches/sketches-core repository has about 560 stars
> > and
> > > > 141
> > > > > > > > forks,
> > > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > > >
> > > > > > > > > > === Development Practices ===
> > > > > > > > > >
> > > > > > > > > > ==== Source Control ====
> > > > > > > > > >
> > > > > > > > > > All of our developers have extensive experience with Git
> > > > version
> > > > > > > > control
> > > > > > > > > > and follow accepted practices for use of Pull Requests
> > (PRs),
> > > > > code
> > > > > > > > reviews
> > > > > > > > > > and commits to master, for example.
> > > > > > > > > >
> > > > > > > > > > ==== Testing ====
> > > > > > > > > >
> > > > > > > > > > Sketches, by their nature are probabilistic programs and
> > > don’t
> > > > > > > > necessarily
> > > > > > > > > > behave deterministically.  For some of the sketches we
> > > > > intentionally
> > > > > > > > insert
> > > > > > > > > > random noise into the code as this gives us the
> > mathematical
> > > > > > > properties
> > > > > > > > > > that we need to guarantee accuracy.  This can make the
> > > behavior
> > > > > of
> > > > > > > > these
> > > > > > > > > > algorithms quite unintuitive and provides significant
> > > > challenges
> > > > > to
> > > > > > > the
> > > > > > > > > > developer who wishes to test these algorithms for
> > > correctness.
> > > > > As a
> > > > > > > > result,
> > > > > > > > > > our testing strategy includes two major components: unit
> > > tests,
> > > > > and
> > > > > > > > > > characterization tests.
> > > > > > > > > >
> > > > > > > > > > ===== Unit Testing =====
> > > > > > > > > >
> > > > > > > > > > Our unit tests are primarily quick tests to make sure
> that
> > we
> > > > > > > exercise
> > > > > > > > all
> > > > > > > > > > critical paths in the code and that key branches are
> > executed
> > > > > > > > correctly. It
> > > > > > > > > > is important that they execute relatively fast as they
> are
> > > > > generally
> > > > > > > > run on
> > > > > > > > > > every code build. The sketches-core repository alone has
> > > about
> > > > 22
> > > > > > > > thousand
> > > > > > > > > > statements, over 1300 unit tests and code coverage of
> about
> > > > > 98.2% as
> > > > > > > > > > measured by Atlassian/Clover.  It is our goal for all of
> > our
> > > > code
> > > > > > > > > > repositories that are used in production that they have
> > code
> > > > > coverage
> > > > > > > > > > greater than 90%.
> > > > > > > > > >
> > > > > > > > > > ===== Characterization Testing =====
> > > > > > > > > >
> > > > > > > > > > In order to test the probabilistic methods that are used
> to
> > > > > interpret
> > > > > > > > the
> > > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > > characterization
> > > > > > > > > > repository that is dedicated to this.  To measure
> accuracy,
> > > for
> > > > > > > > example,
> > > > > > > > > > requires running thousands of trials at each of many
> > > different
> > > > > points
> > > > > > > > along
> > > > > > > > > > the domain axis. Each trial compares its estimated
> results
> > > > > against a
> > > > > > > > known
> > > > > > > > > > exact result producing an error for that trial.  These
> > error
> > > > > > > > measurements
> > > > > > > > > > are then fed into our Quantiles sketch to capture the
> > actual
> > > > > > > > distribution
> > > > > > > > > > of error at that point along the axis. We then select
> > > quantile
> > > > > > > contours
> > > > > > > > > > across all the distributions at points along the axis.
> > These
> > > > > > > contours
> > > > > > > > can
> > > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > > distribution.
> > > > > > > > These
> > > > > > > > > > distributions are not at all Gaussian, in fact they can
> be
> > > > quite
> > > > > > > > complex.
> > > > > > > > > > Nonetheless, these distributions are then checked against
> > our
> > > > > > > > statistical
> > > > > > > > > > guarantees inherent to the specific sketch algorithm and
> > its
> > > > > > > > parameters.
> > > > > > > > > > There are many examples of these characterization error
> > > > > distributions
> > > > > > > > on
> > > > > > > > > > our website. The runtimes of these tests can be very long
> > and
> > > > can
> > > > > > > range
> > > > > > > > > > from many minutes to hours, and some can run for days.
> > > > > Currently, we
> > > > > > > > have
> > > > > > > > > > separate characterization repositories for Java and C++ /
> > > > Python.
> > > > > > > > > >
> > > > > > > > > > It is our goal that we perform this characterization
> > analysis
> > > > > for all
> > > > > > > > of
> > > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > > characterization
> > > > > > > > > > tests is open-source so others can run these tests as
> well.
> > > We
> > > > > do
> > > > > > > not
> > > > > > > > have
> > > > > > > > > > formal releases of this code (because it is not
> production
> > > > code)
> > > > > and
> > > > > > > > it is
> > > > > > > > > > not published to Maven Central.
> > > > > > > > > >
> > > > > > > > > > === Meritocracy ===
> > > > > > > > > >
> > > > > > > > > > DataSketches was initially developed based on
> requirements
> > > > within
> > > > > > > > Yahoo. As
> > > > > > > > > > a project on GitHub, DataSketches has received
> > contributions
> > > > from
> > > > > > > > numerous
> > > > > > > > > > individual developers from around the world, dedicated
> > > research
> > > > > work
> > > > > > > > from
> > > > > > > > > > senior scientists at Amazon and Visa, and academic
> > > researchers
> > > > > from
> > > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > > >
> > > > > > > > > > As a project under incubation, we are committed to
> > expanding
> > > > our
> > > > > > > > effort to
> > > > > > > > > > build an environment which supports a meritocracy. We are
> > > > > focused on
> > > > > > > > > > engaging the community and other related projects for
> > support
> > > > and
> > > > > > > > > > contributions. Moreover, we are committed to ensure
> > > > contributors
> > > > > and
> > > > > > > > > > committers to DataSketches come from a broad mix of
> > > > organizations
> > > > > > > > through a
> > > > > > > > > > merit-based decision process during incubation. We
> believe
> > > > > strongly
> > > > > > > in
> > > > > > > > the
> > > > > > > > > > DataSketches premise that fulfills the concept of a well
> > > > > engineered
> > > > > > > and
> > > > > > > > > > scientifically rigorous library that implements these
> > > powerful
> > > > > > > > algorithms
> > > > > > > > > > and are committed to growing an inclusive community of
> > > > > DataSketches
> > > > > > > > > > contributors and users.
> > > > > > > > > >
> > > > > > > > > > === Community ===
> > > > > > > > > >
> > > > > > > > > > Yahoo has a long history and active engagement in the
> Open
> > > > Source
> > > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> > Moloch,
> > > > > > > Panoptes,
> > > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > > TensorFlowOnSpark,
> > > > > > > > gifshot,
> > > > > > > > > > fluxible, as well as the creation, contribution and
> > > incubation
> > > > of
> > > > > > > many
> > > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > > Oozie,
> > > > > > > > Zookeeper,
> > > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many
> more.
> > > > > > > > > >
> > > > > > > > > > Every day, DataSketches is actively used by a
> organizations
> > > and
> > > > > > > > > > institutions around the world for batch and stream
> > processing
> > > > of
> > > > > > > data.
> > > > > > > > We
> > > > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > > > DataSketches-related work, grow the DataSketches
> community,
> > > and
> > > > > > > deepen
> > > > > > > > > > connections between DataSketches and other open source
> > > > projects.
> > > > > > > > > >
> > > > > > > > > > === Introduction to the Core Developers & Contributors
> ===
> > > > > > > > > >
> > > > > > > > > > The core developers and contributors for DataSketches are
> > > from
> > > > > > > diverse
> > > > > > > > > > backgrounds, but primarily are scientists that love
> > > engineering
> > > > > and
> > > > > > > > > > engineers that love science. A large part of the value we
> > > bring
> > > > > comes
> > > > > > > > from
> > > > > > > > > > this synthesis.  These individuals have already
> contributed
> > > > > > > > substantially
> > > > > > > > > > to the code, algorithms, and/or mathematical proofs that
> > form
> > > > the
> > > > > > > > basis of
> > > > > > > > > > the library.
> > > > > > > > > >
> > > > > > > > > > This core group also form the Initial Committers with
> write
> > > > > > > > permissions to
> > > > > > > > > > the repository. Those marked with (*) Meet weekly to plan
> > the
> > > > > > > research
> > > > > > > > and
> > > > > > > > > > engineering direction of the project.
> > > > > > > > > >
> > > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > > >
> > > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > > Israel.
> > > > > > > > Interests:
> > > > > > > > > > distributed systems, scalable systems and platforms for
> big
> > > > data
> > > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > > >
> > > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> > > Labs,
> > > > > > > > Sunnyvale,
> > > > > > > > > > California. Interests: algorithms, theoretical and
> applied
> > > > > > > mathematics,
> > > > > > > > > > encoding and compression theory, theoretical and applied
> > > > > performance
> > > > > > > > > > optimization.
> > > > > > > > > >
> > > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon
> AI
> > > > Labs,
> > > > > Palo
> > > > > > > > Alto,
> > > > > > > > > > California. Manages the algorithms group at Amazon AI. We
> > > build
> > > > > > > > scalable
> > > > > > > > > > machine learning systems and algorithms which are used
> both
> > > > > > > internally
> > > > > > > > and
> > > > > > > > > > externally by customers of SageMaker, AWS's flagship
> > machine
> > > > > learning
> > > > > > > > > > platform.
> > > > > > > > > >
> > > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs,
> Sunnyvale.
> > > > > Interests:
> > > > > > > > > > Computational advertising, machine learning, speech
> > > > recognition,
> > > > > > > > > > data-driven analysis, large scale experimentation, big
> > data,
> > > > > > > > stream/complex
> > > > > > > > > > event processing
> > > > > > > > > >
> > > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > > Computer
> > > > > > > > Science,
> > > > > > > > > > Georgetown University, Washington D.C. Interests:
> > algorithms
> > > > and
> > > > > > > > > > computational complexity, complexity theory, quantum
> > > > algorithms,
> > > > > > > > private
> > > > > > > > > > data analysis, and learning theory, developing efficient
> > > > > streaming
> > > > > > > and
> > > > > > > > > > sketching algorithms
> > > > > > > > > >
> > > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > > >
> > > > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets
> /
> > > > Snap.
> > > > > > > > Interests:
> > > > > > > > > > design and implementation of data storing and data
> > processing
> > > > > > > > (distributed)
> > > > > > > > > > systems, performance optimization, CPU performance,
> > > mechanical
> > > > > > > > sympathy,
> > > > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > > > structures,
> > > > > > > > > > memory management, garbage collection algorithms,
> language
> > > > > design and
> > > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > > efficiency,
> > > > > > > > Linux,
> > > > > > > > > > code quality, code transformation, pure functional
> > > programming
> > > > > > > models,
> > > > > > > > > > Haskell.
> > > > > > > > > >
> > > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer
> > and
> > > > > founder
> > > > > > > > of
> > > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > > Interests:
> > > > > > > > > > streaming algorithms, mathematics, computer science, high
> > > > > quality and
> > > > > > > > high
> > > > > > > > > > performance code for the analysis of massive data,
> bridging
> > > the
> > > > > > > divide
> > > > > > > > > > between theory and practice.
> > > > > > > > > >
> > > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer,
> Yahoo,
> > > > > Sunnyvale,
> > > > > > > > > > California. Interests: applied mathematics, computer
> > science,
> > > > big
> > > > > > > data,
> > > > > > > > > > distributed systems.
> > > > > > > > > >
> > > > > > > > > > === Introduction to Additional Interested Contributors
> ===
> > > > > > > > > >
> > > > > > > > > > These folks have been intermittently involved and
> > > contributed,
> > > > > but
> > > > > > > are
> > > > > > > > > > strong supporters of this project.
> > > > > > > > > >
> > > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > > >
> > > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > > Computer
> > > > > > > > Science,
> > > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > > matrix
> > > > > > > > > > approximation, streaming algorithms, randomized linear
> > > algebra.
> > > > > > > > > >
> > > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> > > Ph.D.
> > > > > > > > Computer
> > > > > > > > > > Science, Research Instructor, Princeton University.
> > > Interests:
> > > > > > > > algorithmic
> > > > > > > > > > foundations of data science and machine learning,
> efficient
> > > > > methods
> > > > > > > for
> > > > > > > > > > processing and understanding large datasets, often
> working
> > at
> > > > the
> > > > > > > > > > intersection of theoretical computer science, numerical
> > > linear
> > > > > > > > algebra, and
> > > > > > > > > > optimization.
> > > > > > > > > >
> > > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > > > Computer
> > > > > > > > Science,
> > > > > > > > > > Professor, Warwick University, Warwick, England.
> Interests:
> > > all
> > > > > > > > aspects of
> > > > > > > > > > the "data lifecycle", from data collection and cleaning,
> > > > through
> > > > > > > > mining and
> > > > > > > > > > analytics. (Professor Cormode is one of the world’s
> leading
> > > > > > > scientists
> > > > > > > > in
> > > > > > > > > > sketching algorithms)
> > > > > > > > > >
> > > > > > > > > > === Alignment ===
> > > > > > > > > >
> > > > > > > > > > The DataSketches library already provides integrations
> and
> > > > > example
> > > > > > > > code for
> > > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > > integrated
> > > > > into
> > > > > > > > Apache
> > > > > > > > > > Druid.
> > > > > > > > > >
> > > > > > > > > > == Known Risks ==
> > > > > > > > > >
> > > > > > > > > > The following subsections are specific risks that have
> been
> > > > > > > identified
> > > > > > > > by
> > > > > > > > > > the ASF that need to be addressed.
> > > > > > > > > >
> > > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > > >
> > > > > > > > > > The DataSketches library is presently used by a number of
> > > > > > > > organizations,
> > > > > > > > > > from small startups to Fortune 100 companies, to
> construct
> > > > > production
> > > > > > > > > > pipelines that must process and analyze massive data.
> Yahoo
> > > > has a
> > > > > > > > long-term
> > > > > > > > > > commitment to continue to advance the DataSketches
> library;
> > > > > moreover,
> > > > > > > > > > DataSketches is seeing increasing interest, development,
> > and
> > > > > adoption
> > > > > > > > from
> > > > > > > > > > many diverse organizations from around the world. Due to
> > its
> > > > > growing
> > > > > > > > > > adoption, we feel it is quite unlikely that this project
> > > would
> > > > > become
> > > > > > > > > > orphaned.
> > > > > > > > > >
> > > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > > >
> > > > > > > > > > Yahoo believes strongly in open source and the exchange
> of
> > > > > > > information
> > > > > > > > to
> > > > > > > > > > advance new ideas and work. Examples of this commitment
> are
> > > > > active
> > > > > > > open
> > > > > > > > > > source projects such as those mentioned above. With
> > > > > DataSketches, we
> > > > > > > > have
> > > > > > > > > > been increasingly open and forward-looking; we have
> > > published a
> > > > > > > number
> > > > > > > > of
> > > > > > > > > > papers about breakthrough developments in the science of
> > > > > streaming
> > > > > > > > > > algorithms (mentioned above) that also reference the
> > > > DataSketches
> > > > > > > > library.
> > > > > > > > > > Our submission to the Apache Software Foundation is a
> > logical
> > > > > > > > extension of
> > > > > > > > > > our commitment to open source software.
> > > > > > > > > >
> > > > > > > > > > Key committers at Yahoo with strong open source
> backgrounds
> > > > > include
> > > > > > > > Aaron
> > > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > > Braginsky,
> > > > > > > Andrews
> > > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen,
> Bryan
> > > > Call,
> > > > > > > Daryn
> > > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> > > Eshcar
> > > > > > > Hillel,
> > > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > > Perez-Sorrosal,
> > > > > > > > Gil
> > > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai
> Asher,
> > > > James
> > > > > > > > Penick,
> > > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > > > Eagles,
> > > > > > > > Kihwal
> > > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla,
> Michael
> > > > > Trelinski,
> > > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga
> L.
> > > > > > > Natkovich,
> > > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> > > Ruby
> > > > > Loo,
> > > > > > > > Ryan
> > > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu
> > Kit
> > > > > Chan,
> > > > > > > Sri
> > > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and
> many
> > > > more.
> > > > > > > > > >
> > > > > > > > > > All of our core developers are committed to learn about
> the
> > > > > Apache
> > > > > > > > process
> > > > > > > > > > and to give back to the community.
> > > > > > > > > >
> > > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > > >
> > > > > > > > > > The majority of committers in this proposal belong to
> Yahoo
> > > due
> > > > > to
> > > > > > > the
> > > > > > > > fact
> > > > > > > > > > that DataSketches has emerged from an internal Yahoo
> > project.
> > > > > This
> > > > > > > > proposal
> > > > > > > > > > also includes developers and contributors from other
> > > companies,
> > > > > and
> > > > > > > > who are
> > > > > > > > > > actively involved with other Apache projects, such as
> > Druid.
> > > > We
> > > > > > > > expect our
> > > > > > > > > > entry into incubation will allow us to expand the number
> of
> > > > > > > > individuals and
> > > > > > > > > > organizations participating in DataSketches development.
> > > > > > > > > >
> > > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > > >
> > > > > > > > > > Because the DataSketches library originated within Yahoo,
> > it
> > > > has
> > > > > been
> > > > > > > > > > developed primarily by salaried Yahoo developers and we
> > > expect
> > > > > that
> > > > > > > to
> > > > > > > > > > continue to be the case near term. However, since we
> placed
> > > > this
> > > > > > > > library
> > > > > > > > > > into open-source we have had a number of significant
> > > > > contributions
> > > > > > > from
> > > > > > > > > > engineers and scientists from outside of Yahoo. We expect
> > our
> > > > > > > reliance
> > > > > > > > on
> > > > > > > > > > Yahoo salaried developers will decrease over time.
> > > Nonetheless,
> > > > > Yahoo
> > > > > > > > is
> > > > > > > > > > committed to continue its strong support of this
> important
> > > > > project.
> > > > > > > > > >
> > > > > > > > > > === Risk: Lack of Relationship to other Apache Products
> ===
> > > > > > > > > >
> > > > > > > > > > DataSketches already directly interoperates with or
> > utilizes
> > > > > several
> > > > > > > > > > existing Apache projects.
> > > > > > > > > >
> > > > > > > > > > * Build
> > > > > > > > > >    * Apache Maven
> > > > > > > > > >
> > > > > > > > > > * Integrations and adaptors for the following projects
> > > > naturally
> > > > > have
> > > > > > > > them
> > > > > > > > > > as dependencies
> > > > > > > > > >    * Apache Hive
> > > > > > > > > >    * Apache Pig
> > > > > > > > > >    * Apache Druid
> > > > > > > > > >    * Apache Spark
> > > > > > > > > >
> > > > > > > > > > * Additional dependencies for the above integrations and
> > > > adaptors
> > > > > > > > include
> > > > > > > > > >    * Apache Hadoop
> > > > > > > > > >    * Apache Commons (Math)
> > > > > > > > > >
> > > > > > > > > > There is no other Apache project that we are aware of
> that
> > > > > duplicates
> > > > > > > > the
> > > > > > > > > > functionality of the DataSketches library.
> > > > > > > > > >
> > > > > > > > > > === Risk: An Excessive Fascination with the Apache Brand
> > ===
> > > > > > > > > >
> > > > > > > > > > With this proposal we are not seeking attention or
> > publicity.
> > > > > Rather,
> > > > > > > > we
> > > > > > > > > > firmly believe in the DataSketches library and concept
> and
> > > the
> > > > > > > ability
> > > > > > > > to
> > > > > > > > > > make the DataSketches library a powerful, yet
> simple-to-use
> > > > > toolkit
> > > > > > > for
> > > > > > > > > > data processing. While the DataSketches library has been
> > open
> > > > > source,
> > > > > > > > we
> > > > > > > > > > believe putting code on GitHub can only go so far. We see
> > the
> > > > > Apache
> > > > > > > > > > community, processes, and mission as critical for
> ensuring
> > > the
> > > > > > > > DataSketches
> > > > > > > > > > library is truly community-driven, positively impactful,
> > and
> > > > > > > innovative
> > > > > > > > > > open source software. While Yahoo has taken a number of
> > steps
> > > > to
> > > > > > > > advance
> > > > > > > > > > its various open source projects, we believe the
> > DataSketches
> > > > > library
> > > > > > > > > > project is a great fit for the Apache Software Foundation
> > due
> > > > to
> > > > > its
> > > > > > > > focus
> > > > > > > > > > on data processing and its relationships to existing ASF
> > > > > projects.
> > > > > > > > > >
> > > > > > > > > > === Risk: Cryptography ===
> > > > > > > > > >
> > > > > > > > > > DataSketches does not contain any cryptographic code and
> is
> > > > not a
> > > > > > > > > > cryptographic product.
> > > > > > > > > >
> > > > > > > > > > == Documentation ==
> > > > > > > > > >
> > > > > > > > > > The following documentation is relevant to this proposal.
> > > > > Relevant
> > > > > > > > portions
> > > > > > > > > > of the documentation will be contributed to the Apache
> > > > > DataSketches
> > > > > > > > > > project.
> > > > > > > > > >
> > > > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > > > >
> > > > > > > > > > * DataSketches website repository:
> > > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > > >
> > > > > > > > > > We will need an apache website for this documentation
> > similar
> > > > to
> > > > > > > > > >
> > > > > > > > > > * https://datasketches.apache.org
> > > > > > > > > >
> > > > > > > > > > == Initial Source ==
> > > > > > > > > >
> > > > > > > > > > The initial source for DataSketches which we will submit
> to
> > > the
> > > > > > > Apache
> > > > > > > > > > Foundation will include a number of repositories which
> are
> > > > > currently
> > > > > > > > hosted
> > > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > > >
> > > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > > >
> > > > > > > > > > * Java
> > > > > > > > > >    * sketches-core: This repository has the core
> sketching
> > > > > classes,
> > > > > > > > which
> > > > > > > > > > are leveraged by some of the other repositories. This
> > > > repository
> > > > > has
> > > > > > > no
> > > > > > > > > > external dependencies outside of the DataSketches/memory
> > > > > repository,
> > > > > > > > Java
> > > > > > > > > > and TestNG for unit tests. This code is versioned and the
> > > > latest
> > > > > > > > release
> > > > > > > > > > can be obtained from Maven Central.
> > > > > > > > > >    * memory: Low level, high-performance memory
> > > data-structure
> > > > > > > > management
> > > > > > > > > > primarily for off-heap.
> > > > > > > > > >    * sketches-android: This is a new repository dedicated
> > to
> > > > > sketches
> > > > > > > > > > designed to be run in a mobile client, such as a cell
> > phone.
> > > It
> > > > > is
> > > > > > > > still in
> > > > > > > > > > development and should be considered experimental.
> > > > > > > > > >    * sketches-hive: This repository contains Hive UDFs
> and
> > > > UDAFs
> > > > > for
> > > > > > > > use
> > > > > > > > > > within Hadoop grid environments. This code has
> dependencies
> > > on
> > > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> > code
> > > > are
> > > > > > > > advised to
> > > > > > > > > > use Maven to bring in all the required dependencies. This
> > > code
> > > > is
> > > > > > > > versioned
> > > > > > > > > > and the latest release can be obtained from Maven
> Central.
> > > > > > > > > >    * sketches-pig: This repository contains Pig User
> > Defined
> > > > > > > Functions
> > > > > > > > > > (UDF) for use within Hadoop grid environments. This code
> > has
> > > > > > > > dependencies
> > > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> > > code
> > > > > are
> > > > > > > > advised
> > > > > > > > > > to use Maven to bring in all the required dependencies.
> > This
> > > > > code is
> > > > > > > > > > versioned and the latest release can be obtained from
> Maven
> > > > > Central.
> > > > > > > > > >    * sketches-vector: This is a new repository dedicated
> to
> > > > > sketches
> > > > > > > > for
> > > > > > > > > > vector and matrix operations. It is still somewhat
> > > > experimental.
> > > > > > > > > >    * characterization: This relatively new repository is
> > for
> > > > code
> > > > > > > that
> > > > > > > > we
> > > > > > > > > > use to characterize the accuracy and speed performance of
> > the
> > > > > > > sketches
> > > > > > > > in
> > > > > > > > > > the library and is constantly being updated. Examples of
> > the
> > > > job
> > > > > > > > command
> > > > > > > > > > files used for various tests can be found in the
> > > > > src/main/resources
> > > > > > > > > > directory. Some of these tests can run for hours
> depending
> > on
> > > > its
> > > > > > > > > > configuration.
> > > > > > > > > >    * experimental: This repository is an experimental
> > staging
> > > > > area
> > > > > > > for
> > > > > > > > code
> > > > > > > > > > that will eventually end up in another repository. This
> > code
> > > is
> > > > > not
> > > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > > production
> > > > > > > > > > deployment
> > > > > > > > > >
> > > > > > > > > > * C++ and Python
> > > > > > > > > >    * sketches-core-cpp: This is the C++/Python companion
> to
> > > the
> > > > > Java
> > > > > > > > > > sketches-core. These implementations are binary
> compatible
> > > with
> > > > > their
> > > > > > > > > > counterparts in Java. In other words, a sketch created
> and
> > > > > stored in
> > > > > > > > C++
> > > > > > > > > > can be opened and read in Java and visa-versa. This site
> > also
> > > > > has our
> > > > > > > > > > Python adaptors that basically wrap the C++
> > implementations,
> > > > > making
> > > > > > > the
> > > > > > > > > > high performance C++ implementations available from
> Python.
> > > > > > > > > >    * sketches-postgres: This site provides the
> > > > postgres-specific
> > > > > > > > adaptors
> > > > > > > > > > that wrap the C++ implementations making them available
> to
> > > the
> > > > > > > Postgres
> > > > > > > > > > database users.
> > > > > > > > > >    * characterization-cpp: This is the C++/Python
> companion
> > > to
> > > > > the
> > > > > > > Java
> > > > > > > > > > characterization repository.
> > > > > > > > > >    * experimental-cpp: This repository is an experimental
> > > > staging
> > > > > > > area
> > > > > > > > for
> > > > > > > > > > C++ code that will eventually end up in another
> repository.
> > > > > > > > > >
> > > > > > > > > > * Command-Line Tools
> > > > > > > > > >    * sketches-cmd
> > > > > > > > > >    * homebrew-sketches
> > > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > > >
> > > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > > intend
> > > > to
> > > > > > > > bundle
> > > > > > > > > > all of these repositories since they are all
> complementary
> > > and
> > > > > should
> > > > > > > > be
> > > > > > > > > > maintained in one project. Prior to our submission, we
> will
> > > > > combine
> > > > > > > > all of
> > > > > > > > > > these projects into a new git repository.
> > > > > > > > > >
> > > > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > > > >
> > > > > > > > > > Contributors to the DataSketches project have also signed
> > the
> > > > > Yahoo
> > > > > > > > > > Individual Contributor License Agreement (
> > > > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > > > in order to contribute to the project.
> > > > > > > > > >
> > > > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > > > trademark on
> > > > > > > > the
> > > > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > > > receive
> > > > > > > > during the
> > > > > > > > > > incubation process, we are open to renaming the project
> if
> > > > > necessary
> > > > > > > > for
> > > > > > > > > > trademark or other concerns, but we would prefer not to
> > have
> > > to
> > > > > do
> > > > > > > > that.
> > > > > > > > > >
> > > > > > > > > > == External Dependencies ==
> > > > > > > > > >
> > > > > > > > > > All external dependencies are licensed under an Apache
> 2.0
> > or
> > > > > > > > > > Apache-compatible license. As we grow the DataSketches
> > > > community
> > > > > we
> > > > > > > > will
> > > > > > > > > > configure our build process to require and validate all
> > > > > contributions
> > > > > > > > and
> > > > > > > > > > dependencies are licensed under the Apache 2.0 license or
> > are
> > > > > under
> > > > > > > an
> > > > > > > > > > Apache-compatible license.
> > > > > > > > > >
> > > > > > > > > > == Required Resources ==
> > > > > > > > > >
> > > > > > > > > > === Mailing Lists ===
> > > > > > > > > >
> > > > > > > > > > We currently use a mix of mailing lists. We will migrate
> > our
> > > > > existing
> > > > > > > > > > mailing lists to the following:
> > > > > > > > > >
> > > > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > > > >
> > > > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > > > >
> > > > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > > > >
> > > > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > > > >
> > > > > > > > > > === Source Control ===
> > > > > > > > > >
> > > > > > > > > > The DataSketches team currently uses Git and would like
> to
> > > > > continue
> > > > > > > to
> > > > > > > > do
> > > > > > > > > > so. We request a Git repository for DataSketches with
> > > mirroring
> > > > > to
> > > > > > > > GitHub
> > > > > > > > > > enabled similar the following:
> > > > > > > > > >
> > > > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > > > >
> > > > > > > > > > === Issue Tracking ===
> > > > > > > > > >
> > > > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > > > DataSketches
> > > > > > > > project
> > > > > > > > > > is currently using the public GitHub issue tracker and
> the
> > > > public
> > > > > > > > Google
> > > > > > > > > > Groups forum/sketches-user for issue tracking and
> > > discussions.
> > > > We
> > > > > > > will
> > > > > > > > > > migrate and combine from these two sources to the Apache
> > > JIRA.
> > > > > > > > > >
> > > > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > > > >
> > > > > > > > > > == Initial Committers ==
> > > > > > > > > >
> > > > > > > > > > The following list of individuals have been extremely
> > active
> > > in
> > > > > our
> > > > > > > > > > community and should have write (commit) permissions to
> the
> > > > > > > repository.
> > > > > > > > > >
> > > > > > > > > > * Eshcar Hillel                      [eshcar at
> > verizonmedia
> > > > dot
> > > > > com]
> > > > > > > > > >
> > > > > > > > > > * Kevin Lang                    [langk at verizonmedia
> dot
> > > com]
> > > > > > > > > >
> > > > > > > > > > * Roman Leventov              [roman.leventov at
> > > c.metamarkets
> > > > > dot
> > > > > > > com]
> > > > > > > > > >
> > > > > > > > > > * Edo Liberty                   [libertye at amazon dot
> > com]
> > > > > > > > > >
> > > > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia
> > dot
> > > > com]
> > > > > > > > > >
> > > > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia
> dot
> > > > com] &
> > > > > > > > [leerho
> > > > > > > > > > at gmail dot com]
> > > > > > > > > >
> > > > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia
> dot
> > > com]
> > > > > > > > > >
> > > > > > > > > > * Justin Thaler                 [justin.thaler at
> > georgetown
> > > > dot
> > > > > edu]
> > > > > > > > > >
> > > > > > > > > > == Affiliations ==
> > > > > > > > > >
> > > > > > > > > > The initial committers are from four organizations:
> Yahoo,
> > > > > Amazon,
> > > > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > > > >
> > > > > > > > > > === Champion ===
> > > > > > > > > > (Recommended to me: )
> > > > > > > > > >
> > > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > > [chenliang613
> > > > at
> > > > > > > > apache
> > > > > > > > > > dot org]
> > > > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > > > >
> > > > > > > > > > === Nominated Mentors ===
> > > > > > > > > > (Recommended to me: )
> > > > > > > > > >
> > > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > > [chenliang613
> > > > at
> > > > > > > > apache
> > > > > > > > > > dot org]
> > > > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > > > >
> > > > > > > > > > === Sponsoring Entity ===
> > > > > > > > > >
> > > > > > > > > > * The Apache Incubator    **** This is our 1st choice
> ****
> > > > > > > > > >
> > > > > > > > > > * Apache Druid. The incubating Apache Druid project might
> > > also
> > > > > be a
> > > > > > > > logical
> > > > > > > > > > sponsor. However, DataSketches has applications in many
> > areas
> > > > of
> > > > > > > > computing
> > > > > > > > > > outside of Druid so our preference and recommendation is
> > that
> > > > > > > > DataSketches
> > > > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > > > >
> > > > > > > > > > ________________
> > > > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > > > previously
> > > > > > > > acquired
> > > > > > > > > > AOL. The merged entity was originally called Oath, Inc.,
> > but
> > > > has
> > > > > > > > recently
> > > > > > > > > > been renamed Verizon Media, Inc., a wholly-owned
> subsidiary
> > > of
> > > > > > > Verizon,
> > > > > > > > > > Inc.  Since Yahoo is the more recognized name, references
> > in
> > > > this
> > > > > > > > document
> > > > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > > > kenn@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > The subject line has me interested already. Follow
> > examples
> > > > > like
> > > > > > > this
> > > > > > > > > > > maybe?
> > > > > > > > > > >
> > > > > > > > > > > 1.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > > 2.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > >
> > > > > > > > > > > Kenn
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <
> leerho@gmail.com
> > >
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I'll try again ... :)
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > > > ted.dunning@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> It didn't make it again
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <
> > leerho@gmail.com>
> > > > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > I'm not sure the attached document made it
> through.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > > > leerho@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > > > For additional commands, e-mail:
> > > > > > > general-help@incubator.apache.org
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > general-unsubscribe@incubator.apache.org
> > > > > > > > For additional commands, e-mail:
> > > general-help@incubator.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > --
> > > > > > From my cell phone.
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
> --
> From my cell phone.
>

Re: DataSketches Proposal - Google Docs Link

Posted by leerho <le...@gmail.com>.

Yes, I thought of that.  But it’s not like I’m being overwhelmed with
requests to comment ... so far it has been only 3 or 4, and the requested
changes have been minor.  I’m assuming that if there are no more
substantive changes after this week that the document would be moved to the
wiki archive, where, I presume, changes could still be made.

I want to do the right thing here, so if you feel that the document would
get much better feedback on an unrestricted gDoc site, I will set it up.



On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jb...@cloudera.com.invalid>
wrote:

> You could use a Google account that is not under Yahoo’s control, then let
> anyone in the world add a comment, maybe.
>
> On Mon, Feb 25, 2019 at 3:26 PM leerho <le...@gmail.com> wrote:
>
> > Ken,
> > Yahoo does not allow me to create a shared link outside our company,
> except
> > to individual email addresses.  So attempting to share it to the email
> > general@incubator.apache.org may not work.  Nonetheless, several
> > individuals were able to request access using their individual email
> > accounts and I was able to add them.  I will try to add you using
> > kenn@apache.org, but if that doesn't work, I may need a gmail or
> > equivalent
> > account for you.
> >
> > Lee.
> >
> >
> > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > > I could not access that document. I suggest you need to turn on link
> > > sharing.
> > >
> > > Kenn
> > >
> > > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> > > wrote:
> > >
> > > > Try this link:
> > > >
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > >
> > > >
> > > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > > Yes I will try that tomorrow.
> > > > >
> > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > >
> > > > > > Can you share the Google doc with the proposal? Per Ted's advice,
> > we
> > > > can
> > > > > > iterate quickly there and move it to the wiki when it becomes a
> bit
> > > > more
> > > > > > stable.
> > > > > >
> > > > > > Kenn
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> > leerho@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the offer.  i am a neophyte at this process and
> email
> > > > app!   I
> > > > > > > could use a lot of help getting this off the ground!  Also, I'm
> > not
> > > > sure
> > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on
> > :)
> > > > > > >
> > > > > > > Lee.
> > > > > > >
> > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org>
> wrote:
> > > > > > > > Nice.
> > > > > > > >
> > > > > > > > I would very much like to help mentor this project, though
> you
> > > > already
> > > > > > > have
> > > > > > > > a couple good ones.
> > > > > > > >
> > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > >
> > > > > > > > Kenn (VP Apache Beam)
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > > I didn't realize that this mail list does not accept PDF
> > files,
> > > > > > > apparently
> > > > > > > > > only text.  So let me try one more time ... :)  Please let
> me
> > > > know if
> > > > > > > > > this works!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > >
> > > > > > > > > == Abstract ==
> > > > > > > > >
> > > > > > > > > DataSketches.GitHub.io is an open source, high-performance
> > > > library
> > > > > > of
> > > > > > > > > stochastic streaming algorithms commonly called "sketches"
> in
> > > the
> > > > > > data
> > > > > > > > > sciences. Sketches are small, stateful programs that
> process
> > > > massive
> > > > > > > data
> > > > > > > > > as a stream and can provide approximate answers, with
> > > > mathematical
> > > > > > > > > guarantees, to computationally difficult queries
> > > > orders-of-magnitude
> > > > > > > faster
> > > > > > > > > than traditional, exact methods.
> > > > > > > > >
> > > > > > > > > This proposal is to move DataSketches to the Apache
> Software
> > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > intellectual
> > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > officially
> > > > > > > known as
> > > > > > > > > Apache DataSketches and its evolution and governance would
> > come
> > > > under
> > > > > > > the
> > > > > > > > > rules and guidance of the ASF.
> > > > > > > > >
> > > > > > > > > == Introduction ==
> > > > > > > > >
> > > > > > > > > The DataSketches library contains carefully crafted
> > > > implementations
> > > > > > of
> > > > > > > > > sketch algorithms that meet rigorous standards of quality
> and
> > > > > > > performance
> > > > > > > > > and provide capabilities required for large-scale
> production
> > > > systems
> > > > > > > that
> > > > > > > > > must process and analyze massive data. The DataSketches
> core
> > > > > > > repository is
> > > > > > > > > written in Java with a parallel core repository written in
> > C++
> > > > that
> > > > > > > > > includes Python wrappers. The DataSketches library also
> > > includes
> > > > > > > special
> > > > > > > > > repositories for extending the core library for Apache Hive
> > and
> > > > > > Apache
> > > > > > > Pig.
> > > > > > > > > The sketches developed in the different languages share a
> > > common
> > > > > > binary
> > > > > > > > > storage format so that sketches created and stored in Java,
> > for
> > > > > > > example,
> > > > > > > > > can be fully used in C++, and visa versa.  Because the
> stored
> > > > sketch
> > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > images),
> > > > they
> > > > > > > can
> > > > > > > > > be shared across many different systems, languages and
> > > platforms.
> > > > > > > > >
> > > > > > > > > The DataSketches documentation website,
> > > > > > https://datasketches.github.io
> > > > > > > ,
> > > > > > > > > includes general tutorials, a comprehensive research
> section
> > > with
> > > > > > > > > references to relevant academic papers, extensive examples
> > for
> > > > using
> > > > > > > the
> > > > > > > > > core library directly as well as examples for accessing the
> > > > library
> > > > > > in
> > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > >
> > > > > > > > > The DataSketches library also includes a characterization
> > > > repository
> > > > > > > for
> > > > > > > > > long running test programs that are used for studying
> > accuracy
> > > > and
> > > > > > > > > performance of these sketches over wide ranges of input
> > > > variables.
> > > > > > The
> > > > > > > data
> > > > > > > > > produced by these programs is used for generating the many
> > > > > > performance
> > > > > > > > > plots contained in the documentation website and for
> academic
> > > > > > > > > publications.
> > > > > > > > >
> > > > > > > > > The code repositories used for production are versioned and
> > > > published
> > > > > > > to
> > > > > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > > > > >
> > > > > > > > > The DataSketches library also includes several experimental
> > > > > > > repositories
> > > > > > > > > for use-cases outside the large-scale systems environments,
> > > such
> > > > as
> > > > > > > > > sketches for mobile, IoT devices (Android), command-line
> > access
> > > > of
> > > > > > the
> > > > > > > > > sketch library, and an experimental repository for
> > vector-based
> > > > > > > sketches
> > > > > > > > > that performs approximate Singular Value Decomposition
> (SVD)
> > > > analysis
> > > > > > > that
> > > > > > > > > could potentially be used in Machine Learning (ML)
> > > applications.
> > > > > > > > >
> > > > > > > > > == Background ==
> > > > > > > > >
> > > > > > > > > The DataSketches library was started in 2012 as internal
> > Yahoo
> > > > > > project
> > > > > > > to
> > > > > > > > > dramatically reduce time and resources required for
> distinct
> > > > (unique)
> > > > > > > > > counting.  An extensive search on the Internet at the time
> > > > yielded a
> > > > > > > number
> > > > > > > > > of theoretical papers on stochastic streaming algorithms
> with
> > > > > > > pseudocode
> > > > > > > > > examples, but we did not find any usable open-source code
> of
> > > the
> > > > > > > quality we
> > > > > > > > > felt we needed for our internal production systems.  So we
> > > > started a
> > > > > > > small
> > > > > > > > > project (one person) to develop our own sketches working
> > > directly
> > > > > > from
> > > > > > > > > published theoretical papers.
> > > > > > > > >
> > > > > > > > > The DataSketches library was designed from the start with
> the
> > > > > > > objective of
> > > > > > > > > making these algorithms, usually only described in
> > theoretical
> > > > > > papers,
> > > > > > > > > easily accessible to systems developers for use in our
> > internal
> > > > > > > production
> > > > > > > > > systems. By necessity, the code had to be of the highest
> > > quality
> > > > and
> > > > > > > > > thoroughly tested. The wide variety of our internal
> > production
> > > > > > systems
> > > > > > > > > drove the requirement that the sketch implementations had
> to
> > > > have an
> > > > > > > > > absolute minimum of external, run-time dependencies in
> order
> > to
> > > > > > > simplify
> > > > > > > > > integration and troubleshooting.
> > > > > > > > >
> > > > > > > > > Our internal experiments demonstrated dramatic positive
> > impact
> > > > on the
> > > > > > > > > performance of our systems.  As a result, the DataSketches
> > > > library
> > > > > > > quickly
> > > > > > > > > evolved to include different types of sketches for
> different
> > > > types of
> > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > algorithms,
> > > > > > > > > quantile/histogram algorithms, and weighted and unweighted
> > > > sampling
> > > > > > > > > algorithms.
> > > > > > > > >
> > > > > > > > > We quickly discovered that developing these sketch
> algorithms
> > > to
> > > > be
> > > > > > > truly
> > > > > > > > > robust in production environments is quite difficult and
> > > requires
> > > > > > deep
> > > > > > > > > understanding of the underlying mathematics and statistics
> as
> > > > well as
> > > > > > > > > extensive experience in developing high quality code for
> 24/7
> > > > > > > production
> > > > > > > > > systems. This is a difficult combination of skills for any
> > one
> > > > > > > organization
> > > > > > > > > to collect and maintain over time. It became clear that
> this
> > > > > > technology
> > > > > > > > > needed a community larger than Yahoo to evolve.  In
> November,
> > > > 2015,
> > > > > > > this
> > > > > > > > > factor, along with Yahoo’s strong experience and support of
> > > open
> > > > > > > source,
> > > > > > > > > led to the decision to open source this technology under an
> > > > Apache
> > > > > > 2.0
> > > > > > > > > license on GitHub. Since that time our community has
> expanded
> > > > > > > considerably
> > > > > > > > > and the key contributors to this effort includes leading
> > > research
> > > > > > > > > scientists from a number of universities as well as
> > > > practitioners and
> > > > > > > > > researchers from a number of major corporations. The core
> of
> > > this
> > > > > > > group is
> > > > > > > > > very active as we meet weekly to discuss research
> directions
> > > and
> > > > > > > > > engineering priorities.
> > > > > > > > >
> > > > > > > > > It is important to note that our internal systems at Yahoo
> > use
> > > > the
> > > > > > > current
> > > > > > > > > public GitHub open source DataSketches library and not an
> > > > internal
> > > > > > > version
> > > > > > > > > of the code.
> > > > > > > > >
> > > > > > > > > The close collaboration of scientific research and
> > engineering
> > > > > > > development
> > > > > > > > > experience with actual massive-data processing systems has
> > also
> > > > > > > produced
> > > > > > > > > new research publications in the field of stochastic
> > streaming
> > > > > > > algorithms,
> > > > > > > > > for example:
> > > > > > > > >
> > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty,
> > Lee
> > > > > > > Rhodes, and
> > > > > > > > > Justin Thaler. A high-performance algorithm for identifying
> > > > frequent
> > > > > > > items
> > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > >
> > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > > Thaler. A
> > > > > > > > > framework for estimating stream expression cardinalities.
> In
> > > > > > *EDBT/ICDT
> > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > >
> > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > > Frequent
> > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > Proceedings
> > > > > > > ‘16,
> > > > > > > > > pages 845-854, 2016.
> > > > > > > > >
> > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> > > > quantile
> > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> pages
> > > > 71–78,
> > > > > > > 2016.
> > > > > > > > >
> > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > optimal
> > > > > > > cardinality
> > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > 2017.
> > > > > > > > >
> > > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching.
> In
> > > ACM
> > > > KDD
> > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > >
> > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > Jonathan
> > > > > > > Ullman.
> > > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> > PODS
> > > > > > > Proceedings
> > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > >
> > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > > > > Hierarchical
> > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> ALENEX
> > > > > > > Proceedings
> > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > >
> > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > >
> > > > > > > > > In the analysis of big data there are often problem queries
> > > that
> > > > > > don’t
> > > > > > > > > scale because they require huge compute resources and time
> to
> > > > > > generate
> > > > > > > > > exact results. Examples include count distinct, quantiles,
> > most
> > > > > > > frequent
> > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > >
> > > > > > > > > If we can loosen the requirement of “exact” results from
> our
> > > > queries
> > > > > > > and be
> > > > > > > > > satisfied with approximate results, within some well
> > understood
> > > > > > bounds
> > > > > > > of
> > > > > > > > > error, there is an entire branch of mathematics and data
> > > science
> > > > that
> > > > > > > has
> > > > > > > > > evolved around developing algorithms that can produce
> > > approximate
> > > > > > > results
> > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > >
> > > > > > > > > With the additional requirements that these algorithms must
> > be
> > > > small
> > > > > > > > > (compared to the size of the input data), sublinear (the
> size
> > > of
> > > > the
> > > > > > > sketch
> > > > > > > > > must grow at a slower rate than the size of the input
> > stream),
> > > > > > > streaming
> > > > > > > > > (they can only touch each data item once), and mergeable
> > > > (suitable
> > > > > > for
> > > > > > > > > distributed processing), defines a class of algorithms that
> > can
> > > > be
> > > > > > > > > described as small, stochastic, streaming, sublinear
> > mergeable
> > > > > > > algorithms,
> > > > > > > > > commonly called sketches (they also have other names, but
> we
> > > > will use
> > > > > > > the
> > > > > > > > > term sketches from here on).
> > > > > > > > >
> > > > > > > > > To be truly streaming and be able to process data in a
> single
> > > > pass,
> > > > > > > > > sketches must make absolute minimum assumptions about the
> > input
> > > > > > stream.
> > > > > > > > > This is critically important, as there is no “second
> chance”
> > to
> > > > > > > process the
> > > > > > > > > data.
> > > > > > > > >
> > > > > > > > > For example, sketches should not make assumptions about the
> > > > order of
> > > > > > > stream
> > > > > > > > > items, the stream length, the dynamic range of values, or
> the
> > > > > > > distribution
> > > > > > > > > of item occurrence frequencies. Sketches should be tolerant
> > of
> > > > NaNs,
> > > > > > > Nulls
> > > > > > > > > and empty objects. About the only thing that the sketch
> needs
> > > to
> > > > know
> > > > > > > about
> > > > > > > > > the stream is how to extract items from it and what type
> the
> > > > item is,
> > > > > > > e.g.,
> > > > > > > > > is it a numeric value or a string.
> > > > > > > > >
> > > > > > > > > As far as the sketch is concerned, the input stream is a
> > > > sequence of
> > > > > > > items
> > > > > > > > > in some unknown random order with unknown random values.
> > > > > > > > >
> > > > > > > > > The sketch is essentially a complex state machine and
> > combined
> > > > with
> > > > > > the
> > > > > > > > > random input stream defines a stochastic process. We then
> > apply
> > > > > > > > > probabilistic methods to interpret the states of the
> > stochastic
> > > > > > > process in
> > > > > > > > > order to extract useful information about the input stream
> > > > itself.
> > > > > > The
> > > > > > > > > resulting information will be approximate, but we also use
> > > > additional
> > > > > > > > > probabilistic methods to extract an estimate of the likely
> > > > > > probability
> > > > > > > > > distribution of error.
> > > > > > > > >
> > > > > > > > > There is a significant scientific contribution here that is
> > > > defining
> > > > > > > the
> > > > > > > > > state machine, understanding the resulting stochastic
> > process,
> > > > > > > developing
> > > > > > > > > the probabilistic methods, and proving mathematically, that
> > it
> > > > all
> > > > > > > works!
> > > > > > > > > This is why the scientific contributors to this project
> are a
> > > > > > critical
> > > > > > > and
> > > > > > > > > strategic component to our success.  The development
> > engineers
> > > > > > > translate
> > > > > > > > > the concepts of the proposed state machine and
> probabilistic
> > > > methods
> > > > > > > into
> > > > > > > > > production-quality code. Even more important, they work
> > closely
> > > > with
> > > > > > > the
> > > > > > > > > scientists, feeding back system and user requirements,
> which
> > > > leads
> > > > > > not
> > > > > > > only
> > > > > > > > > to superior product design, but to new science as well.  A
> > > > number of
> > > > > > > > > scientific papers our members have published (see above)
> is a
> > > > direct
> > > > > > > result
> > > > > > > > > of this close collaboration.
> > > > > > > > >
> > > > > > > > > Because sketches are small they can be processed extremely
> > > fast,
> > > > > > often
> > > > > > > many
> > > > > > > > > orders-of-magnitude faster than traditional exact
> > computations.
> > > > For
> > > > > > > > > interactive queries there may not be other viable
> > alternatives,
> > > > and
> > > > > > in
> > > > > > > the
> > > > > > > > > case of real-time analysis, sketches are the only known
> > > solution.
> > > > > > > > >
> > > > > > > > > For any system that needs to extract useful information
> from
> > > > massive
> > > > > > > data
> > > > > > > > > sketches are essential tools that should be tightly
> > integrated
> > > > into
> > > > > > the
> > > > > > > > > system’s analysis capabilities. This technology has helped
> > > Yahoo
> > > > > > > > > successfully reduce data processing times from days to
> hours
> > or
> > > > > > > minutes on
> > > > > > > > > a number of its internal platforms and has enabled
> subsecond
> > > > queries
> > > > > > on
> > > > > > > > > real-time platforms that would have been infeasible without
> > > > sketches.
> > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > Other open source implementations of sketch algorithms can
> be
> > > > found
> > > > > > on
> > > > > > > the
> > > > > > > > > Internet. However, we have not yet found any open source
> > > > > > > implementations
> > > > > > > > > that are as comprehensive, engineered with the quality
> > required
> > > > for
> > > > > > > > > production systems, and with usable and guaranteed error
> > > > properties.
> > > > > > > Large
> > > > > > > > > Internet companies, such as Google and Facebook, have
> > published
> > > > > > papers
> > > > > > > on
> > > > > > > > > sketching, however, their implementations of their
> published
> > > > > > > algorithms are
> > > > > > > > > proprietary and not available as open source.
> > > > > > > > >
> > > > > > > > > The DataSketches library already provides integrations
> with a
> > > > number
> > > > > > of
> > > > > > > > > major Apache data processing platforms such as Apache Hive,
> > > > Apache
> > > > > > Pig,
> > > > > > > > > Apache Spark and Apache Druid, and is also integrated with
> a
> > > > number
> > > > > > of
> > > > > > > > > other open source data processing platforms such as Splice
> > > > Machine,
> > > > > > > GCHQ
> > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > >
> > > > > > > > > We believe that having DataSketches as an Apache project
> will
> > > > provide
> > > > > > > an
> > > > > > > > > immediate, worthwhile, and substantial contribution to the
> > open
> > > > > > source
> > > > > > > > > community, will have a better opportunity to provide a
> > > meaningful
> > > > > > > > > contribution to both the science and engineering of
> sketching
> > > > > > > algorithms,
> > > > > > > > > and integrate with other Apache projects.  In addition,
> this
> > > is a
> > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > destination
> > > > for
> > > > > > > users
> > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > >
> > > > > > > > > == Initial Goals ==
> > > > > > > > >
> > > > > > > > > We are breaking our initial goals into short-term (2-6
> > months)
> > > > and
> > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > >
> > > > > > > > > Our short-term goals include:
> > > > > > > > >
> > > > > > > > > * Understanding and adapting to the Apache development
> > process
> > > > and
> > > > > > > > > structures.
> > > > > > > > >
> > > > > > > > > * Start refactoring codebase and move various DataSketches
> > > > > > repositories
> > > > > > > > > code to Apache Git repository.
> > > > > > > > >
> > > > > > > > > * Continue development of new features, functions, and
> fixes.
> > > > > > > > >
> > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> continue
> > to
> > > > be
> > > > > > > > > developed and expanded.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The intermediate to long term goals include:
> > > > > > > > >
> > > > > > > > > * Completing the design and implementation of the C++
> > sketches
> > > to
> > > > > > > > > complement what is already available in Java, and the
> Python
> > > > wrappers
> > > > > > > of
> > > > > > > > > those C++ sketches.
> > > > > > > > >
> > > > > > > > > * Expanding the C++ build framework to include Windows and
> > the
> > > > > > popular
> > > > > > > > > Linux variants.
> > > > > > > > >
> > > > > > > > > * Continued engagement with the scientific research
> community
> > > on
> > > > the
> > > > > > > > > development of new algorithms for computationally difficult
> > > > problems
> > > > > > > that
> > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > >
> > > > > > > > > == Current Status ==
> > > > > > > > >
> > > > > > > > > The DataSketches GitHub project has been quite successful.
> > As
> > > of
> > > > > > this
> > > > > > > > > writing (Feb, 2019) the number of downloads measured by the
> > > Nexus
> > > > > > > > > Repository Manager at https://oss.sonatype.org has grown
> by
> > > > nearly a
> > > > > > > > > factor
> > > > > > > > > of 10 over the past year to about 55 thousand per month.
> The
> > > > > > > > > DataSketches/sketches-core repository has about 560 stars
> and
> > > 141
> > > > > > > forks,
> > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > >
> > > > > > > > > === Development Practices ===
> > > > > > > > >
> > > > > > > > > ==== Source Control ====
> > > > > > > > >
> > > > > > > > > All of our developers have extensive experience with Git
> > > version
> > > > > > > control
> > > > > > > > > and follow accepted practices for use of Pull Requests
> (PRs),
> > > > code
> > > > > > > reviews
> > > > > > > > > and commits to master, for example.
> > > > > > > > >
> > > > > > > > > ==== Testing ====
> > > > > > > > >
> > > > > > > > > Sketches, by their nature are probabilistic programs and
> > don’t
> > > > > > > necessarily
> > > > > > > > > behave deterministically.  For some of the sketches we
> > > > intentionally
> > > > > > > insert
> > > > > > > > > random noise into the code as this gives us the
> mathematical
> > > > > > properties
> > > > > > > > > that we need to guarantee accuracy.  This can make the
> > behavior
> > > > of
> > > > > > > these
> > > > > > > > > algorithms quite unintuitive and provides significant
> > > challenges
> > > > to
> > > > > > the
> > > > > > > > > developer who wishes to test these algorithms for
> > correctness.
> > > > As a
> > > > > > > result,
> > > > > > > > > our testing strategy includes two major components: unit
> > tests,
> > > > and
> > > > > > > > > characterization tests.
> > > > > > > > >
> > > > > > > > > ===== Unit Testing =====
> > > > > > > > >
> > > > > > > > > Our unit tests are primarily quick tests to make sure that
> we
> > > > > > exercise
> > > > > > > all
> > > > > > > > > critical paths in the code and that key branches are
> executed
> > > > > > > correctly. It
> > > > > > > > > is important that they execute relatively fast as they are
> > > > generally
> > > > > > > run on
> > > > > > > > > every code build. The sketches-core repository alone has
> > about
> > > 22
> > > > > > > thousand
> > > > > > > > > statements, over 1300 unit tests and code coverage of about
> > > > 98.2% as
> > > > > > > > > measured by Atlassian/Clover.  It is our goal for all of
> our
> > > code
> > > > > > > > > repositories that are used in production that they have
> code
> > > > coverage
> > > > > > > > > greater than 90%.
> > > > > > > > >
> > > > > > > > > ===== Characterization Testing =====
> > > > > > > > >
> > > > > > > > > In order to test the probabilistic methods that are used to
> > > > interpret
> > > > > > > the
> > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > characterization
> > > > > > > > > repository that is dedicated to this.  To measure accuracy,
> > for
> > > > > > > example,
> > > > > > > > > requires running thousands of trials at each of many
> > different
> > > > points
> > > > > > > along
> > > > > > > > > the domain axis. Each trial compares its estimated results
> > > > against a
> > > > > > > known
> > > > > > > > > exact result producing an error for that trial.  These
> error
> > > > > > > measurements
> > > > > > > > > are then fed into our Quantiles sketch to capture the
> actual
> > > > > > > distribution
> > > > > > > > > of error at that point along the axis. We then select
> > quantile
> > > > > > contours
> > > > > > > > > across all the distributions at points along the axis.
> These
> > > > > > contours
> > > > > > > can
> > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > distribution.
> > > > > > > These
> > > > > > > > > distributions are not at all Gaussian, in fact they can be
> > > quite
> > > > > > > complex.
> > > > > > > > > Nonetheless, these distributions are then checked against
> our
> > > > > > > statistical
> > > > > > > > > guarantees inherent to the specific sketch algorithm and
> its
> > > > > > > parameters.
> > > > > > > > > There are many examples of these characterization error
> > > > distributions
> > > > > > > on
> > > > > > > > > our website. The runtimes of these tests can be very long
> and
> > > can
> > > > > > range
> > > > > > > > > from many minutes to hours, and some can run for days.
> > > > Currently, we
> > > > > > > have
> > > > > > > > > separate characterization repositories for Java and C++ /
> > > Python.
> > > > > > > > >
> > > > > > > > > It is our goal that we perform this characterization
> analysis
> > > > for all
> > > > > > > of
> > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > characterization
> > > > > > > > > tests is open-source so others can run these tests as well.
> > We
> > > > do
> > > > > > not
> > > > > > > have
> > > > > > > > > formal releases of this code (because it is not production
> > > code)
> > > > and
> > > > > > > it is
> > > > > > > > > not published to Maven Central.
> > > > > > > > >
> > > > > > > > > === Meritocracy ===
> > > > > > > > >
> > > > > > > > > DataSketches was initially developed based on requirements
> > > within
> > > > > > > Yahoo. As
> > > > > > > > > a project on GitHub, DataSketches has received
> contributions
> > > from
> > > > > > > numerous
> > > > > > > > > individual developers from around the world, dedicated
> > research
> > > > work
> > > > > > > from
> > > > > > > > > senior scientists at Amazon and Visa, and academic
> > researchers
> > > > from
> > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > >
> > > > > > > > > As a project under incubation, we are committed to
> expanding
> > > our
> > > > > > > effort to
> > > > > > > > > build an environment which supports a meritocracy. We are
> > > > focused on
> > > > > > > > > engaging the community and other related projects for
> support
> > > and
> > > > > > > > > contributions. Moreover, we are committed to ensure
> > > contributors
> > > > and
> > > > > > > > > committers to DataSketches come from a broad mix of
> > > organizations
> > > > > > > through a
> > > > > > > > > merit-based decision process during incubation. We believe
> > > > strongly
> > > > > > in
> > > > > > > the
> > > > > > > > > DataSketches premise that fulfills the concept of a well
> > > > engineered
> > > > > > and
> > > > > > > > > scientifically rigorous library that implements these
> > powerful
> > > > > > > algorithms
> > > > > > > > > and are committed to growing an inclusive community of
> > > > DataSketches
> > > > > > > > > contributors and users.
> > > > > > > > >
> > > > > > > > > === Community ===
> > > > > > > > >
> > > > > > > > > Yahoo has a long history and active engagement in the Open
> > > Source
> > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> Moloch,
> > > > > > Panoptes,
> > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > TensorFlowOnSpark,
> > > > > > > gifshot,
> > > > > > > > > fluxible, as well as the creation, contribution and
> > incubation
> > > of
> > > > > > many
> > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > Oozie,
> > > > > > > Zookeeper,
> > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > > > > >
> > > > > > > > > Every day, DataSketches is actively used by a organizations
> > and
> > > > > > > > > institutions around the world for batch and stream
> processing
> > > of
> > > > > > data.
> > > > > > > We
> > > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > > DataSketches-related work, grow the DataSketches community,
> > and
> > > > > > deepen
> > > > > > > > > connections between DataSketches and other open source
> > > projects.
> > > > > > > > >
> > > > > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > > > > >
> > > > > > > > > The core developers and contributors for DataSketches are
> > from
> > > > > > diverse
> > > > > > > > > backgrounds, but primarily are scientists that love
> > engineering
> > > > and
> > > > > > > > > engineers that love science. A large part of the value we
> > bring
> > > > comes
> > > > > > > from
> > > > > > > > > this synthesis.  These individuals have already contributed
> > > > > > > substantially
> > > > > > > > > to the code, algorithms, and/or mathematical proofs that
> form
> > > the
> > > > > > > basis of
> > > > > > > > > the library.
> > > > > > > > >
> > > > > > > > > This core group also form the Initial Committers with write
> > > > > > > permissions to
> > > > > > > > > the repository. Those marked with (*) Meet weekly to plan
> the
> > > > > > research
> > > > > > > and
> > > > > > > > > engineering direction of the project.
> > > > > > > > >
> > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > >
> > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > Israel.
> > > > > > > Interests:
> > > > > > > > > distributed systems, scalable systems and platforms for big
> > > data
> > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > >
> > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> > Labs,
> > > > > > > Sunnyvale,
> > > > > > > > > California. Interests: algorithms, theoretical and applied
> > > > > > mathematics,
> > > > > > > > > encoding and compression theory, theoretical and applied
> > > > performance
> > > > > > > > > optimization.
> > > > > > > > >
> > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> > > Labs,
> > > > Palo
> > > > > > > Alto,
> > > > > > > > > California. Manages the algorithms group at Amazon AI. We
> > build
> > > > > > > scalable
> > > > > > > > > machine learning systems and algorithms which are used both
> > > > > > internally
> > > > > > > and
> > > > > > > > > externally by customers of SageMaker, AWS's flagship
> machine
> > > > learning
> > > > > > > > > platform.
> > > > > > > > >
> > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > > > Interests:
> > > > > > > > > Computational advertising, machine learning, speech
> > > recognition,
> > > > > > > > > data-driven analysis, large scale experimentation, big
> data,
> > > > > > > stream/complex
> > > > > > > > > event processing
> > > > > > > > >
> > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > Computer
> > > > > > > Science,
> > > > > > > > > Georgetown University, Washington D.C. Interests:
> algorithms
> > > and
> > > > > > > > > computational complexity, complexity theory, quantum
> > > algorithms,
> > > > > > > private
> > > > > > > > > data analysis, and learning theory, developing efficient
> > > > streaming
> > > > > > and
> > > > > > > > > sketching algorithms
> > > > > > > > >
> > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > >
> > > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> > > Snap.
> > > > > > > Interests:
> > > > > > > > > design and implementation of data storing and data
> processing
> > > > > > > (distributed)
> > > > > > > > > systems, performance optimization, CPU performance,
> > mechanical
> > > > > > > sympathy,
> > > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > > structures,
> > > > > > > > > memory management, garbage collection algorithms, language
> > > > design and
> > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > efficiency,
> > > > > > > Linux,
> > > > > > > > > code quality, code transformation, pure functional
> > programming
> > > > > > models,
> > > > > > > > > Haskell.
> > > > > > > > >
> > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer
> and
> > > > founder
> > > > > > > of
> > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > Interests:
> > > > > > > > > streaming algorithms, mathematics, computer science, high
> > > > quality and
> > > > > > > high
> > > > > > > > > performance code for the analysis of massive data, bridging
> > the
> > > > > > divide
> > > > > > > > > between theory and practice.
> > > > > > > > >
> > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > > > Sunnyvale,
> > > > > > > > > California. Interests: applied mathematics, computer
> science,
> > > big
> > > > > > data,
> > > > > > > > > distributed systems.
> > > > > > > > >
> > > > > > > > > === Introduction to Additional Interested Contributors ===
> > > > > > > > >
> > > > > > > > > These folks have been intermittently involved and
> > contributed,
> > > > but
> > > > > > are
> > > > > > > > > strong supporters of this project.
> > > > > > > > >
> > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > >
> > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > Computer
> > > > > > > Science,
> > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > matrix
> > > > > > > > > approximation, streaming algorithms, randomized linear
> > algebra.
> > > > > > > > >
> > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> > Ph.D.
> > > > > > > Computer
> > > > > > > > > Science, Research Instructor, Princeton University.
> > Interests:
> > > > > > > algorithmic
> > > > > > > > > foundations of data science and machine learning, efficient
> > > > methods
> > > > > > for
> > > > > > > > > processing and understanding large datasets, often working
> at
> > > the
> > > > > > > > > intersection of theoretical computer science, numerical
> > linear
> > > > > > > algebra, and
> > > > > > > > > optimization.
> > > > > > > > >
> > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > > Computer
> > > > > > > Science,
> > > > > > > > > Professor, Warwick University, Warwick, England. Interests:
> > all
> > > > > > > aspects of
> > > > > > > > > the "data lifecycle", from data collection and cleaning,
> > > through
> > > > > > > mining and
> > > > > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > > > > scientists
> > > > > > > in
> > > > > > > > > sketching algorithms)
> > > > > > > > >
> > > > > > > > > === Alignment ===
> > > > > > > > >
> > > > > > > > > The DataSketches library already provides integrations and
> > > > example
> > > > > > > code for
> > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > integrated
> > > > into
> > > > > > > Apache
> > > > > > > > > Druid.
> > > > > > > > >
> > > > > > > > > == Known Risks ==
> > > > > > > > >
> > > > > > > > > The following subsections are specific risks that have been
> > > > > > identified
> > > > > > > by
> > > > > > > > > the ASF that need to be addressed.
> > > > > > > > >
> > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > >
> > > > > > > > > The DataSketches library is presently used by a number of
> > > > > > > organizations,
> > > > > > > > > from small startups to Fortune 100 companies, to construct
> > > > production
> > > > > > > > > pipelines that must process and analyze massive data. Yahoo
> > > has a
> > > > > > > long-term
> > > > > > > > > commitment to continue to advance the DataSketches library;
> > > > moreover,
> > > > > > > > > DataSketches is seeing increasing interest, development,
> and
> > > > adoption
> > > > > > > from
> > > > > > > > > many diverse organizations from around the world. Due to
> its
> > > > growing
> > > > > > > > > adoption, we feel it is quite unlikely that this project
> > would
> > > > become
> > > > > > > > > orphaned.
> > > > > > > > >
> > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > >
> > > > > > > > > Yahoo believes strongly in open source and the exchange of
> > > > > > information
> > > > > > > to
> > > > > > > > > advance new ideas and work. Examples of this commitment are
> > > > active
> > > > > > open
> > > > > > > > > source projects such as those mentioned above. With
> > > > DataSketches, we
> > > > > > > have
> > > > > > > > > been increasingly open and forward-looking; we have
> > published a
> > > > > > number
> > > > > > > of
> > > > > > > > > papers about breakthrough developments in the science of
> > > > streaming
> > > > > > > > > algorithms (mentioned above) that also reference the
> > > DataSketches
> > > > > > > library.
> > > > > > > > > Our submission to the Apache Software Foundation is a
> logical
> > > > > > > extension of
> > > > > > > > > our commitment to open source software.
> > > > > > > > >
> > > > > > > > > Key committers at Yahoo with strong open source backgrounds
> > > > include
> > > > > > > Aaron
> > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > Braginsky,
> > > > > > Andrews
> > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> > > Call,
> > > > > > Daryn
> > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> > Eshcar
> > > > > > Hillel,
> > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > Perez-Sorrosal,
> > > > > > > Gil
> > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> > > James
> > > > > > > Penick,
> > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > > Eagles,
> > > > > > > Kihwal
> > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > > > Trelinski,
> > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > > > > Natkovich,
> > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> > Ruby
> > > > Loo,
> > > > > > > Ryan
> > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu
> Kit
> > > > Chan,
> > > > > > Sri
> > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> > > more.
> > > > > > > > >
> > > > > > > > > All of our core developers are committed to learn about the
> > > > Apache
> > > > > > > process
> > > > > > > > > and to give back to the community.
> > > > > > > > >
> > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > >
> > > > > > > > > The majority of committers in this proposal belong to Yahoo
> > due
> > > > to
> > > > > > the
> > > > > > > fact
> > > > > > > > > that DataSketches has emerged from an internal Yahoo
> project.
> > > > This
> > > > > > > proposal
> > > > > > > > > also includes developers and contributors from other
> > companies,
> > > > and
> > > > > > > who are
> > > > > > > > > actively involved with other Apache projects, such as
> Druid.
> > > We
> > > > > > > expect our
> > > > > > > > > entry into incubation will allow us to expand the number of
> > > > > > > individuals and
> > > > > > > > > organizations participating in DataSketches development.
> > > > > > > > >
> > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > >
> > > > > > > > > Because the DataSketches library originated within Yahoo,
> it
> > > has
> > > > been
> > > > > > > > > developed primarily by salaried Yahoo developers and we
> > expect
> > > > that
> > > > > > to
> > > > > > > > > continue to be the case near term. However, since we placed
> > > this
> > > > > > > library
> > > > > > > > > into open-source we have had a number of significant
> > > > contributions
> > > > > > from
> > > > > > > > > engineers and scientists from outside of Yahoo. We expect
> our
> > > > > > reliance
> > > > > > > on
> > > > > > > > > Yahoo salaried developers will decrease over time.
> > Nonetheless,
> > > > Yahoo
> > > > > > > is
> > > > > > > > > committed to continue its strong support of this important
> > > > project.
> > > > > > > > >
> > > > > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > > > > >
> > > > > > > > > DataSketches already directly interoperates with or
> utilizes
> > > > several
> > > > > > > > > existing Apache projects.
> > > > > > > > >
> > > > > > > > > * Build
> > > > > > > > >    * Apache Maven
> > > > > > > > >
> > > > > > > > > * Integrations and adaptors for the following projects
> > > naturally
> > > > have
> > > > > > > them
> > > > > > > > > as dependencies
> > > > > > > > >    * Apache Hive
> > > > > > > > >    * Apache Pig
> > > > > > > > >    * Apache Druid
> > > > > > > > >    * Apache Spark
> > > > > > > > >
> > > > > > > > > * Additional dependencies for the above integrations and
> > > adaptors
> > > > > > > include
> > > > > > > > >    * Apache Hadoop
> > > > > > > > >    * Apache Commons (Math)
> > > > > > > > >
> > > > > > > > > There is no other Apache project that we are aware of that
> > > > duplicates
> > > > > > > the
> > > > > > > > > functionality of the DataSketches library.
> > > > > > > > >
> > > > > > > > > === Risk: An Excessive Fascination with the Apache Brand
> ===
> > > > > > > > >
> > > > > > > > > With this proposal we are not seeking attention or
> publicity.
> > > > Rather,
> > > > > > > we
> > > > > > > > > firmly believe in the DataSketches library and concept and
> > the
> > > > > > ability
> > > > > > > to
> > > > > > > > > make the DataSketches library a powerful, yet simple-to-use
> > > > toolkit
> > > > > > for
> > > > > > > > > data processing. While the DataSketches library has been
> open
> > > > source,
> > > > > > > we
> > > > > > > > > believe putting code on GitHub can only go so far. We see
> the
> > > > Apache
> > > > > > > > > community, processes, and mission as critical for ensuring
> > the
> > > > > > > DataSketches
> > > > > > > > > library is truly community-driven, positively impactful,
> and
> > > > > > innovative
> > > > > > > > > open source software. While Yahoo has taken a number of
> steps
> > > to
> > > > > > > advance
> > > > > > > > > its various open source projects, we believe the
> DataSketches
> > > > library
> > > > > > > > > project is a great fit for the Apache Software Foundation
> due
> > > to
> > > > its
> > > > > > > focus
> > > > > > > > > on data processing and its relationships to existing ASF
> > > > projects.
> > > > > > > > >
> > > > > > > > > === Risk: Cryptography ===
> > > > > > > > >
> > > > > > > > > DataSketches does not contain any cryptographic code and is
> > > not a
> > > > > > > > > cryptographic product.
> > > > > > > > >
> > > > > > > > > == Documentation ==
> > > > > > > > >
> > > > > > > > > The following documentation is relevant to this proposal.
> > > > Relevant
> > > > > > > portions
> > > > > > > > > of the documentation will be contributed to the Apache
> > > > DataSketches
> > > > > > > > > project.
> > > > > > > > >
> > > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > > >
> > > > > > > > > * DataSketches website repository:
> > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > >
> > > > > > > > > We will need an apache website for this documentation
> similar
> > > to
> > > > > > > > >
> > > > > > > > > * https://datasketches.apache.org
> > > > > > > > >
> > > > > > > > > == Initial Source ==
> > > > > > > > >
> > > > > > > > > The initial source for DataSketches which we will submit to
> > the
> > > > > > Apache
> > > > > > > > > Foundation will include a number of repositories which are
> > > > currently
> > > > > > > hosted
> > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > >
> > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > >
> > > > > > > > > * Java
> > > > > > > > >    * sketches-core: This repository has the core sketching
> > > > classes,
> > > > > > > which
> > > > > > > > > are leveraged by some of the other repositories. This
> > > repository
> > > > has
> > > > > > no
> > > > > > > > > external dependencies outside of the DataSketches/memory
> > > > repository,
> > > > > > > Java
> > > > > > > > > and TestNG for unit tests. This code is versioned and the
> > > latest
> > > > > > > release
> > > > > > > > > can be obtained from Maven Central.
> > > > > > > > >    * memory: Low level, high-performance memory
> > data-structure
> > > > > > > management
> > > > > > > > > primarily for off-heap.
> > > > > > > > >    * sketches-android: This is a new repository dedicated
> to
> > > > sketches
> > > > > > > > > designed to be run in a mobile client, such as a cell
> phone.
> > It
> > > > is
> > > > > > > still in
> > > > > > > > > development and should be considered experimental.
> > > > > > > > >    * sketches-hive: This repository contains Hive UDFs and
> > > UDAFs
> > > > for
> > > > > > > use
> > > > > > > > > within Hadoop grid environments. This code has dependencies
> > on
> > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> code
> > > are
> > > > > > > advised to
> > > > > > > > > use Maven to bring in all the required dependencies. This
> > code
> > > is
> > > > > > > versioned
> > > > > > > > > and the latest release can be obtained from Maven Central.
> > > > > > > > >    * sketches-pig: This repository contains Pig User
> Defined
> > > > > > Functions
> > > > > > > > > (UDF) for use within Hadoop grid environments. This code
> has
> > > > > > > dependencies
> > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> > code
> > > > are
> > > > > > > advised
> > > > > > > > > to use Maven to bring in all the required dependencies.
> This
> > > > code is
> > > > > > > > > versioned and the latest release can be obtained from Maven
> > > > Central.
> > > > > > > > >    * sketches-vector: This is a new repository dedicated to
> > > > sketches
> > > > > > > for
> > > > > > > > > vector and matrix operations. It is still somewhat
> > > experimental.
> > > > > > > > >    * characterization: This relatively new repository is
> for
> > > code
> > > > > > that
> > > > > > > we
> > > > > > > > > use to characterize the accuracy and speed performance of
> the
> > > > > > sketches
> > > > > > > in
> > > > > > > > > the library and is constantly being updated. Examples of
> the
> > > job
> > > > > > > command
> > > > > > > > > files used for various tests can be found in the
> > > > src/main/resources
> > > > > > > > > directory. Some of these tests can run for hours depending
> on
> > > its
> > > > > > > > > configuration.
> > > > > > > > >    * experimental: This repository is an experimental
> staging
> > > > area
> > > > > > for
> > > > > > > code
> > > > > > > > > that will eventually end up in another repository. This
> code
> > is
> > > > not
> > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > production
> > > > > > > > > deployment
> > > > > > > > >
> > > > > > > > > * C++ and Python
> > > > > > > > >    * sketches-core-cpp: This is the C++/Python companion to
> > the
> > > > Java
> > > > > > > > > sketches-core. These implementations are binary compatible
> > with
> > > > their
> > > > > > > > > counterparts in Java. In other words, a sketch created and
> > > > stored in
> > > > > > > C++
> > > > > > > > > can be opened and read in Java and visa-versa. This site
> also
> > > > has our
> > > > > > > > > Python adaptors that basically wrap the C++
> implementations,
> > > > making
> > > > > > the
> > > > > > > > > high performance C++ implementations available from Python.
> > > > > > > > >    * sketches-postgres: This site provides the
> > > postgres-specific
> > > > > > > adaptors
> > > > > > > > > that wrap the C++ implementations making them available to
> > the
> > > > > > Postgres
> > > > > > > > > database users.
> > > > > > > > >    * characterization-cpp: This is the C++/Python companion
> > to
> > > > the
> > > > > > Java
> > > > > > > > > characterization repository.
> > > > > > > > >    * experimental-cpp: This repository is an experimental
> > > staging
> > > > > > area
> > > > > > > for
> > > > > > > > > C++ code that will eventually end up in another repository.
> > > > > > > > >
> > > > > > > > > * Command-Line Tools
> > > > > > > > >    * sketches-cmd
> > > > > > > > >    * homebrew-sketches
> > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > >
> > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > intend
> > > to
> > > > > > > bundle
> > > > > > > > > all of these repositories since they are all complementary
> > and
> > > > should
> > > > > > > be
> > > > > > > > > maintained in one project. Prior to our submission, we will
> > > > combine
> > > > > > > all of
> > > > > > > > > these projects into a new git repository.
> > > > > > > > >
> > > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > > >
> > > > > > > > > Contributors to the DataSketches project have also signed
> the
> > > > Yahoo
> > > > > > > > > Individual Contributor License Agreement (
> > > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > > in order to contribute to the project.
> > > > > > > > >
> > > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > > trademark on
> > > > > > > the
> > > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > > receive
> > > > > > > during the
> > > > > > > > > incubation process, we are open to renaming the project if
> > > > necessary
> > > > > > > for
> > > > > > > > > trademark or other concerns, but we would prefer not to
> have
> > to
> > > > do
> > > > > > > that.
> > > > > > > > >
> > > > > > > > > == External Dependencies ==
> > > > > > > > >
> > > > > > > > > All external dependencies are licensed under an Apache 2.0
> or
> > > > > > > > > Apache-compatible license. As we grow the DataSketches
> > > community
> > > > we
> > > > > > > will
> > > > > > > > > configure our build process to require and validate all
> > > > contributions
> > > > > > > and
> > > > > > > > > dependencies are licensed under the Apache 2.0 license or
> are
> > > > under
> > > > > > an
> > > > > > > > > Apache-compatible license.
> > > > > > > > >
> > > > > > > > > == Required Resources ==
> > > > > > > > >
> > > > > > > > > === Mailing Lists ===
> > > > > > > > >
> > > > > > > > > We currently use a mix of mailing lists. We will migrate
> our
> > > > existing
> > > > > > > > > mailing lists to the following:
> > > > > > > > >
> > > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > === Source Control ===
> > > > > > > > >
> > > > > > > > > The DataSketches team currently uses Git and would like to
> > > > continue
> > > > > > to
> > > > > > > do
> > > > > > > > > so. We request a Git repository for DataSketches with
> > mirroring
> > > > to
> > > > > > > GitHub
> > > > > > > > > enabled similar the following:
> > > > > > > > >
> > > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > > >
> > > > > > > > > === Issue Tracking ===
> > > > > > > > >
> > > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > > DataSketches
> > > > > > > project
> > > > > > > > > is currently using the public GitHub issue tracker and the
> > > public
> > > > > > > Google
> > > > > > > > > Groups forum/sketches-user for issue tracking and
> > discussions.
> > > We
> > > > > > will
> > > > > > > > > migrate and combine from these two sources to the Apache
> > JIRA.
> > > > > > > > >
> > > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > > >
> > > > > > > > > == Initial Committers ==
> > > > > > > > >
> > > > > > > > > The following list of individuals have been extremely
> active
> > in
> > > > our
> > > > > > > > > community and should have write (commit) permissions to the
> > > > > > repository.
> > > > > > > > >
> > > > > > > > > * Eshcar Hillel                      [eshcar at
> verizonmedia
> > > dot
> > > > com]
> > > > > > > > >
> > > > > > > > > * Kevin Lang                    [langk at verizonmedia dot
> > com]
> > > > > > > > >
> > > > > > > > > * Roman Leventov              [roman.leventov at
> > c.metamarkets
> > > > dot
> > > > > > com]
> > > > > > > > >
> > > > > > > > > * Edo Liberty                   [libertye at amazon dot
> com]
> > > > > > > > >
> > > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia
> dot
> > > com]
> > > > > > > > >
> > > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> > > com] &
> > > > > > > [leerho
> > > > > > > > > at gmail dot com]
> > > > > > > > >
> > > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot
> > com]
> > > > > > > > >
> > > > > > > > > * Justin Thaler                 [justin.thaler at
> georgetown
> > > dot
> > > > edu]
> > > > > > > > >
> > > > > > > > > == Affiliations ==
> > > > > > > > >
> > > > > > > > > The initial committers are from four organizations: Yahoo,
> > > > Amazon,
> > > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > > >
> > > > > > > > > === Champion ===
> > > > > > > > > (Recommended to me: )
> > > > > > > > >
> > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > [chenliang613
> > > at
> > > > > > > apache
> > > > > > > > > dot org]
> > > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > > >
> > > > > > > > > === Nominated Mentors ===
> > > > > > > > > (Recommended to me: )
> > > > > > > > >
> > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > [chenliang613
> > > at
> > > > > > > apache
> > > > > > > > > dot org]
> > > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > > >
> > > > > > > > > === Sponsoring Entity ===
> > > > > > > > >
> > > > > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > > > > >
> > > > > > > > > * Apache Druid. The incubating Apache Druid project might
> > also
> > > > be a
> > > > > > > logical
> > > > > > > > > sponsor. However, DataSketches has applications in many
> areas
> > > of
> > > > > > > computing
> > > > > > > > > outside of Druid so our preference and recommendation is
> that
> > > > > > > DataSketches
> > > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > > >
> > > > > > > > > ________________
> > > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > > previously
> > > > > > > acquired
> > > > > > > > > AOL. The merged entity was originally called Oath, Inc.,
> but
> > > has
> > > > > > > recently
> > > > > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary
> > of
> > > > > > Verizon,
> > > > > > > > > Inc.  Since Yahoo is the more recognized name, references
> in
> > > this
> > > > > > > document
> > > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > > kenn@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The subject line has me interested already. Follow
> examples
> > > > like
> > > > > > this
> > > > > > > > > > maybe?
> > > > > > > > > >
> > > > > > > > > > 1.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > 2.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > >
> > > > > > > > > > Kenn
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <leerho@gmail.com
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I'll try again ... :)
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > > ted.dunning@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> It didn't make it again
> > > > > > > > > > >>
> > > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <
> leerho@gmail.com>
> > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > > leerho@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > > general-help@incubator.apache.org
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> general-unsubscribe@incubator.apache.org
> > > > > > > For additional commands, e-mail:
> > general-help@incubator.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > --
> > > > > From my cell phone.
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > > >
> > >
> >
>
-- 
From my cell phone.

Re: DataSketches Proposal - Google Docs Link

Posted by Jim Apple <jb...@cloudera.com.INVALID>.

You could use a Google account that is not under Yahoo’s control, then let
anyone in the world add a comment, maybe.

On Mon, Feb 25, 2019 at 3:26 PM leerho <le...@gmail.com> wrote:

> Ken,
> Yahoo does not allow me to create a shared link outside our company, except
> to individual email addresses.  So attempting to share it to the email
> general@incubator.apache.org may not work.  Nonetheless, several
> individuals were able to request access using their individual email
> accounts and I was able to add them.  I will try to add you using
> kenn@apache.org, but if that doesn't work, I may need a gmail or
> equivalent
> account for you.
>
> Lee.
>
>
> On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> > I could not access that document. I suggest you need to turn on link
> > sharing.
> >
> > Kenn
> >
> > On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> > wrote:
> >
> > > Try this link:
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > >
> > >
> > > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > > Yes I will try that tomorrow.
> > > >
> > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org>
> > wrote:
> > > >
> > > > > Can you share the Google doc with the proposal? Per Ted's advice,
> we
> > > can
> > > > > iterate quickly there and move it to the wiki when it becomes a bit
> > > more
> > > > > stable.
> > > > >
> > > > > Kenn
> > > > >
> > > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <
> leerho@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the offer.  i am a neophyte at this process and email
> > > app!   I
> > > > > > could use a lot of help getting this off the ground!  Also, I'm
> not
> > > sure
> > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on
> :)
> > > > > >
> > > > > > Lee.
> > > > > >
> > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > > > > Nice.
> > > > > > >
> > > > > > > I would very much like to help mentor this project, though you
> > > already
> > > > > > have
> > > > > > > a couple good ones.
> > > > > > >
> > > > > > > I concur with incubator as sponsoring entity.
> > > > > > >
> > > > > > > Kenn (VP Apache Beam)
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > I didn't realize that this mail list does not accept PDF
> files,
> > > > > > apparently
> > > > > > > > only text.  So let me try one more time ... :)  Please let me
> > > know if
> > > > > > > > this works!
> > > > > > > >
> > > > > > > >
> > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > >
> > > > > > > > == Abstract ==
> > > > > > > >
> > > > > > > > DataSketches.GitHub.io is an open source, high-performance
> > > library
> > > > > of
> > > > > > > > stochastic streaming algorithms commonly called "sketches" in
> > the
> > > > > data
> > > > > > > > sciences. Sketches are small, stateful programs that process
> > > massive
> > > > > > data
> > > > > > > > as a stream and can provide approximate answers, with
> > > mathematical
> > > > > > > > guarantees, to computationally difficult queries
> > > orders-of-magnitude
> > > > > > faster
> > > > > > > > than traditional, exact methods.
> > > > > > > >
> > > > > > > > This proposal is to move DataSketches to the Apache Software
> > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > intellectual
> > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > officially
> > > > > > known as
> > > > > > > > Apache DataSketches and its evolution and governance would
> come
> > > under
> > > > > > the
> > > > > > > > rules and guidance of the ASF.
> > > > > > > >
> > > > > > > > == Introduction ==
> > > > > > > >
> > > > > > > > The DataSketches library contains carefully crafted
> > > implementations
> > > > > of
> > > > > > > > sketch algorithms that meet rigorous standards of quality and
> > > > > > performance
> > > > > > > > and provide capabilities required for large-scale production
> > > systems
> > > > > > that
> > > > > > > > must process and analyze massive data. The DataSketches core
> > > > > > repository is
> > > > > > > > written in Java with a parallel core repository written in
> C++
> > > that
> > > > > > > > includes Python wrappers. The DataSketches library also
> > includes
> > > > > > special
> > > > > > > > repositories for extending the core library for Apache Hive
> and
> > > > > Apache
> > > > > > Pig.
> > > > > > > > The sketches developed in the different languages share a
> > common
> > > > > binary
> > > > > > > > storage format so that sketches created and stored in Java,
> for
> > > > > > example,
> > > > > > > > can be fully used in C++, and visa versa.  Because the stored
> > > sketch
> > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > images),
> > > they
> > > > > > can
> > > > > > > > be shared across many different systems, languages and
> > platforms.
> > > > > > > >
> > > > > > > > The DataSketches documentation website,
> > > > > https://datasketches.github.io
> > > > > > ,
> > > > > > > > includes general tutorials, a comprehensive research section
> > with
> > > > > > > > references to relevant academic papers, extensive examples
> for
> > > using
> > > > > > the
> > > > > > > > core library directly as well as examples for accessing the
> > > library
> > > > > in
> > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > >
> > > > > > > > The DataSketches library also includes a characterization
> > > repository
> > > > > > for
> > > > > > > > long running test programs that are used for studying
> accuracy
> > > and
> > > > > > > > performance of these sketches over wide ranges of input
> > > variables.
> > > > > The
> > > > > > data
> > > > > > > > produced by these programs is used for generating the many
> > > > > performance
> > > > > > > > plots contained in the documentation website and for academic
> > > > > > > > publications.
> > > > > > > >
> > > > > > > > The code repositories used for production are versioned and
> > > published
> > > > > > to
> > > > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > > > >
> > > > > > > > The DataSketches library also includes several experimental
> > > > > > repositories
> > > > > > > > for use-cases outside the large-scale systems environments,
> > such
> > > as
> > > > > > > > sketches for mobile, IoT devices (Android), command-line
> access
> > > of
> > > > > the
> > > > > > > > sketch library, and an experimental repository for
> vector-based
> > > > > > sketches
> > > > > > > > that performs approximate Singular Value Decomposition (SVD)
> > > analysis
> > > > > > that
> > > > > > > > could potentially be used in Machine Learning (ML)
> > applications.
> > > > > > > >
> > > > > > > > == Background ==
> > > > > > > >
> > > > > > > > The DataSketches library was started in 2012 as internal
> Yahoo
> > > > > project
> > > > > > to
> > > > > > > > dramatically reduce time and resources required for distinct
> > > (unique)
> > > > > > > > counting.  An extensive search on the Internet at the time
> > > yielded a
> > > > > > number
> > > > > > > > of theoretical papers on stochastic streaming algorithms with
> > > > > > pseudocode
> > > > > > > > examples, but we did not find any usable open-source code of
> > the
> > > > > > quality we
> > > > > > > > felt we needed for our internal production systems.  So we
> > > started a
> > > > > > small
> > > > > > > > project (one person) to develop our own sketches working
> > directly
> > > > > from
> > > > > > > > published theoretical papers.
> > > > > > > >
> > > > > > > > The DataSketches library was designed from the start with the
> > > > > > objective of
> > > > > > > > making these algorithms, usually only described in
> theoretical
> > > > > papers,
> > > > > > > > easily accessible to systems developers for use in our
> internal
> > > > > > production
> > > > > > > > systems. By necessity, the code had to be of the highest
> > quality
> > > and
> > > > > > > > thoroughly tested. The wide variety of our internal
> production
> > > > > systems
> > > > > > > > drove the requirement that the sketch implementations had to
> > > have an
> > > > > > > > absolute minimum of external, run-time dependencies in order
> to
> > > > > > simplify
> > > > > > > > integration and troubleshooting.
> > > > > > > >
> > > > > > > > Our internal experiments demonstrated dramatic positive
> impact
> > > on the
> > > > > > > > performance of our systems.  As a result, the DataSketches
> > > library
> > > > > > quickly
> > > > > > > > evolved to include different types of sketches for different
> > > types of
> > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > algorithms,
> > > > > > > > quantile/histogram algorithms, and weighted and unweighted
> > > sampling
> > > > > > > > algorithms.
> > > > > > > >
> > > > > > > > We quickly discovered that developing these sketch algorithms
> > to
> > > be
> > > > > > truly
> > > > > > > > robust in production environments is quite difficult and
> > requires
> > > > > deep
> > > > > > > > understanding of the underlying mathematics and statistics as
> > > well as
> > > > > > > > extensive experience in developing high quality code for 24/7
> > > > > > production
> > > > > > > > systems. This is a difficult combination of skills for any
> one
> > > > > > organization
> > > > > > > > to collect and maintain over time. It became clear that this
> > > > > technology
> > > > > > > > needed a community larger than Yahoo to evolve.  In November,
> > > 2015,
> > > > > > this
> > > > > > > > factor, along with Yahoo’s strong experience and support of
> > open
> > > > > > source,
> > > > > > > > led to the decision to open source this technology under an
> > > Apache
> > > > > 2.0
> > > > > > > > license on GitHub. Since that time our community has expanded
> > > > > > considerably
> > > > > > > > and the key contributors to this effort includes leading
> > research
> > > > > > > > scientists from a number of universities as well as
> > > practitioners and
> > > > > > > > researchers from a number of major corporations. The core of
> > this
> > > > > > group is
> > > > > > > > very active as we meet weekly to discuss research directions
> > and
> > > > > > > > engineering priorities.
> > > > > > > >
> > > > > > > > It is important to note that our internal systems at Yahoo
> use
> > > the
> > > > > > current
> > > > > > > > public GitHub open source DataSketches library and not an
> > > internal
> > > > > > version
> > > > > > > > of the code.
> > > > > > > >
> > > > > > > > The close collaboration of scientific research and
> engineering
> > > > > > development
> > > > > > > > experience with actual massive-data processing systems has
> also
> > > > > > produced
> > > > > > > > new research publications in the field of stochastic
> streaming
> > > > > > algorithms,
> > > > > > > > for example:
> > > > > > > >
> > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty,
> Lee
> > > > > > Rhodes, and
> > > > > > > > Justin Thaler. A high-performance algorithm for identifying
> > > frequent
> > > > > > items
> > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > >
> > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > Thaler. A
> > > > > > > > framework for estimating stream expression cardinalities. In
> > > > > *EDBT/ICDT
> > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > >
> > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > Frequent
> > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > Proceedings
> > > > > > ‘16,
> > > > > > > > pages 845-854, 2016.
> > > > > > > >
> > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> > > quantile
> > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> > > 71–78,
> > > > > > 2016.
> > > > > > > >
> > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> optimal
> > > > > > cardinality
> > > > > > > > estimation algorithm. arXiv preprint
> > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > 2017.
> > > > > > > >
> > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In
> > ACM
> > > KDD
> > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > >
> > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > Jonathan
> > > > > > Ullman.
> > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> PODS
> > > > > > Proceedings
> > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > >
> > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > > > Hierarchical
> > > > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > > > > Proceedings
> > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > >
> > > > > > > > == The Rationale for Sketches ==
> > > > > > > >
> > > > > > > > In the analysis of big data there are often problem queries
> > that
> > > > > don’t
> > > > > > > > scale because they require huge compute resources and time to
> > > > > generate
> > > > > > > > exact results. Examples include count distinct, quantiles,
> most
> > > > > > frequent
> > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > >
> > > > > > > > If we can loosen the requirement of “exact” results from our
> > > queries
> > > > > > and be
> > > > > > > > satisfied with approximate results, within some well
> understood
> > > > > bounds
> > > > > > of
> > > > > > > > error, there is an entire branch of mathematics and data
> > science
> > > that
> > > > > > has
> > > > > > > > evolved around developing algorithms that can produce
> > approximate
> > > > > > results
> > > > > > > > with mathematically well-defined error properties.
> > > > > > > >
> > > > > > > > With the additional requirements that these algorithms must
> be
> > > small
> > > > > > > > (compared to the size of the input data), sublinear (the size
> > of
> > > the
> > > > > > sketch
> > > > > > > > must grow at a slower rate than the size of the input
> stream),
> > > > > > streaming
> > > > > > > > (they can only touch each data item once), and mergeable
> > > (suitable
> > > > > for
> > > > > > > > distributed processing), defines a class of algorithms that
> can
> > > be
> > > > > > > > described as small, stochastic, streaming, sublinear
> mergeable
> > > > > > algorithms,
> > > > > > > > commonly called sketches (they also have other names, but we
> > > will use
> > > > > > the
> > > > > > > > term sketches from here on).
> > > > > > > >
> > > > > > > > To be truly streaming and be able to process data in a single
> > > pass,
> > > > > > > > sketches must make absolute minimum assumptions about the
> input
> > > > > stream.
> > > > > > > > This is critically important, as there is no “second chance”
> to
> > > > > > process the
> > > > > > > > data.
> > > > > > > >
> > > > > > > > For example, sketches should not make assumptions about the
> > > order of
> > > > > > stream
> > > > > > > > items, the stream length, the dynamic range of values, or the
> > > > > > distribution
> > > > > > > > of item occurrence frequencies. Sketches should be tolerant
> of
> > > NaNs,
> > > > > > Nulls
> > > > > > > > and empty objects. About the only thing that the sketch needs
> > to
> > > know
> > > > > > about
> > > > > > > > the stream is how to extract items from it and what type the
> > > item is,
> > > > > > e.g.,
> > > > > > > > is it a numeric value or a string.
> > > > > > > >
> > > > > > > > As far as the sketch is concerned, the input stream is a
> > > sequence of
> > > > > > items
> > > > > > > > in some unknown random order with unknown random values.
> > > > > > > >
> > > > > > > > The sketch is essentially a complex state machine and
> combined
> > > with
> > > > > the
> > > > > > > > random input stream defines a stochastic process. We then
> apply
> > > > > > > > probabilistic methods to interpret the states of the
> stochastic
> > > > > > process in
> > > > > > > > order to extract useful information about the input stream
> > > itself.
> > > > > The
> > > > > > > > resulting information will be approximate, but we also use
> > > additional
> > > > > > > > probabilistic methods to extract an estimate of the likely
> > > > > probability
> > > > > > > > distribution of error.
> > > > > > > >
> > > > > > > > There is a significant scientific contribution here that is
> > > defining
> > > > > > the
> > > > > > > > state machine, understanding the resulting stochastic
> process,
> > > > > > developing
> > > > > > > > the probabilistic methods, and proving mathematically, that
> it
> > > all
> > > > > > works!
> > > > > > > > This is why the scientific contributors to this project are a
> > > > > critical
> > > > > > and
> > > > > > > > strategic component to our success.  The development
> engineers
> > > > > > translate
> > > > > > > > the concepts of the proposed state machine and probabilistic
> > > methods
> > > > > > into
> > > > > > > > production-quality code. Even more important, they work
> closely
> > > with
> > > > > > the
> > > > > > > > scientists, feeding back system and user requirements, which
> > > leads
> > > > > not
> > > > > > only
> > > > > > > > to superior product design, but to new science as well.  A
> > > number of
> > > > > > > > scientific papers our members have published (see above) is a
> > > direct
> > > > > > result
> > > > > > > > of this close collaboration.
> > > > > > > >
> > > > > > > > Because sketches are small they can be processed extremely
> > fast,
> > > > > often
> > > > > > many
> > > > > > > > orders-of-magnitude faster than traditional exact
> computations.
> > > For
> > > > > > > > interactive queries there may not be other viable
> alternatives,
> > > and
> > > > > in
> > > > > > the
> > > > > > > > case of real-time analysis, sketches are the only known
> > solution.
> > > > > > > >
> > > > > > > > For any system that needs to extract useful information from
> > > massive
> > > > > > data
> > > > > > > > sketches are essential tools that should be tightly
> integrated
> > > into
> > > > > the
> > > > > > > > system’s analysis capabilities. This technology has helped
> > Yahoo
> > > > > > > > successfully reduce data processing times from days to hours
> or
> > > > > > minutes on
> > > > > > > > a number of its internal platforms and has enabled subsecond
> > > queries
> > > > > on
> > > > > > > > real-time platforms that would have been infeasible without
> > > sketches.
> > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > Other open source implementations of sketch algorithms can be
> > > found
> > > > > on
> > > > > > the
> > > > > > > > Internet. However, we have not yet found any open source
> > > > > > implementations
> > > > > > > > that are as comprehensive, engineered with the quality
> required
> > > for
> > > > > > > > production systems, and with usable and guaranteed error
> > > properties.
> > > > > > Large
> > > > > > > > Internet companies, such as Google and Facebook, have
> published
> > > > > papers
> > > > > > on
> > > > > > > > sketching, however, their implementations of their published
> > > > > > algorithms are
> > > > > > > > proprietary and not available as open source.
> > > > > > > >
> > > > > > > > The DataSketches library already provides integrations with a
> > > number
> > > > > of
> > > > > > > > major Apache data processing platforms such as Apache Hive,
> > > Apache
> > > > > Pig,
> > > > > > > > Apache Spark and Apache Druid, and is also integrated with a
> > > number
> > > > > of
> > > > > > > > other open source data processing platforms such as Splice
> > > Machine,
> > > > > > GCHQ
> > > > > > > > Gaffer and PostgreSQL.
> > > > > > > >
> > > > > > > > We believe that having DataSketches as an Apache project will
> > > provide
> > > > > > an
> > > > > > > > immediate, worthwhile, and substantial contribution to the
> open
> > > > > source
> > > > > > > > community, will have a better opportunity to provide a
> > meaningful
> > > > > > > > contribution to both the science and engineering of sketching
> > > > > > algorithms,
> > > > > > > > and integrate with other Apache projects.  In addition, this
> > is a
> > > > > > > > significant opportunity for Apache to be the "go-to"
> > destination
> > > for
> > > > > > users
> > > > > > > > that want to leverage this exciting technology.
> > > > > > > >
> > > > > > > > == Initial Goals ==
> > > > > > > >
> > > > > > > > We are breaking our initial goals into short-term (2-6
> months)
> > > and
> > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > >
> > > > > > > > Our short-term goals include:
> > > > > > > >
> > > > > > > > * Understanding and adapting to the Apache development
> process
> > > and
> > > > > > > > structures.
> > > > > > > >
> > > > > > > > * Start refactoring codebase and move various DataSketches
> > > > > repositories
> > > > > > > > code to Apache Git repository.
> > > > > > > >
> > > > > > > > * Continue development of new features, functions, and fixes.
> > > > > > > >
> > > > > > > > * Specific sub-projects (e.g., C++ and Python) will continue
> to
> > > be
> > > > > > > > developed and expanded.
> > > > > > > >
> > > > > > > >
> > > > > > > > The intermediate to long term goals include:
> > > > > > > >
> > > > > > > > * Completing the design and implementation of the C++
> sketches
> > to
> > > > > > > > complement what is already available in Java, and the Python
> > > wrappers
> > > > > > of
> > > > > > > > those C++ sketches.
> > > > > > > >
> > > > > > > > * Expanding the C++ build framework to include Windows and
> the
> > > > > popular
> > > > > > > > Linux variants.
> > > > > > > >
> > > > > > > > * Continued engagement with the scientific research community
> > on
> > > the
> > > > > > > > development of new algorithms for computationally difficult
> > > problems
> > > > > > that
> > > > > > > > heretofore have not had a sketching solution.
> > > > > > > >
> > > > > > > > == Current Status ==
> > > > > > > >
> > > > > > > > The DataSketches GitHub project has been quite successful.
> As
> > of
> > > > > this
> > > > > > > > writing (Feb, 2019) the number of downloads measured by the
> > Nexus
> > > > > > > > Repository Manager at https://oss.sonatype.org has grown by
> > > nearly a
> > > > > > > > factor
> > > > > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > > > > DataSketches/sketches-core repository has about 560 stars and
> > 141
> > > > > > forks,
> > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > >
> > > > > > > > === Development Practices ===
> > > > > > > >
> > > > > > > > ==== Source Control ====
> > > > > > > >
> > > > > > > > All of our developers have extensive experience with Git
> > version
> > > > > > control
> > > > > > > > and follow accepted practices for use of Pull Requests (PRs),
> > > code
> > > > > > reviews
> > > > > > > > and commits to master, for example.
> > > > > > > >
> > > > > > > > ==== Testing ====
> > > > > > > >
> > > > > > > > Sketches, by their nature are probabilistic programs and
> don’t
> > > > > > necessarily
> > > > > > > > behave deterministically.  For some of the sketches we
> > > intentionally
> > > > > > insert
> > > > > > > > random noise into the code as this gives us the mathematical
> > > > > properties
> > > > > > > > that we need to guarantee accuracy.  This can make the
> behavior
> > > of
> > > > > > these
> > > > > > > > algorithms quite unintuitive and provides significant
> > challenges
> > > to
> > > > > the
> > > > > > > > developer who wishes to test these algorithms for
> correctness.
> > > As a
> > > > > > result,
> > > > > > > > our testing strategy includes two major components: unit
> tests,
> > > and
> > > > > > > > characterization tests.
> > > > > > > >
> > > > > > > > ===== Unit Testing =====
> > > > > > > >
> > > > > > > > Our unit tests are primarily quick tests to make sure that we
> > > > > exercise
> > > > > > all
> > > > > > > > critical paths in the code and that key branches are executed
> > > > > > correctly. It
> > > > > > > > is important that they execute relatively fast as they are
> > > generally
> > > > > > run on
> > > > > > > > every code build. The sketches-core repository alone has
> about
> > 22
> > > > > > thousand
> > > > > > > > statements, over 1300 unit tests and code coverage of about
> > > 98.2% as
> > > > > > > > measured by Atlassian/Clover.  It is our goal for all of our
> > code
> > > > > > > > repositories that are used in production that they have code
> > > coverage
> > > > > > > > greater than 90%.
> > > > > > > >
> > > > > > > > ===== Characterization Testing =====
> > > > > > > >
> > > > > > > > In order to test the probabilistic methods that are used to
> > > interpret
> > > > > > the
> > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > characterization
> > > > > > > > repository that is dedicated to this.  To measure accuracy,
> for
> > > > > > example,
> > > > > > > > requires running thousands of trials at each of many
> different
> > > points
> > > > > > along
> > > > > > > > the domain axis. Each trial compares its estimated results
> > > against a
> > > > > > known
> > > > > > > > exact result producing an error for that trial.  These error
> > > > > > measurements
> > > > > > > > are then fed into our Quantiles sketch to capture the actual
> > > > > > distribution
> > > > > > > > of error at that point along the axis. We then select
> quantile
> > > > > contours
> > > > > > > > across all the distributions at points along the axis.  These
> > > > > contours
> > > > > > can
> > > > > > > > then be plotted to reveal the shape of the actual error
> > > distribution.
> > > > > > These
> > > > > > > > distributions are not at all Gaussian, in fact they can be
> > quite
> > > > > > complex.
> > > > > > > > Nonetheless, these distributions are then checked against our
> > > > > > statistical
> > > > > > > > guarantees inherent to the specific sketch algorithm and its
> > > > > > parameters.
> > > > > > > > There are many examples of these characterization error
> > > distributions
> > > > > > on
> > > > > > > > our website. The runtimes of these tests can be very long and
> > can
> > > > > range
> > > > > > > > from many minutes to hours, and some can run for days.
> > > Currently, we
> > > > > > have
> > > > > > > > separate characterization repositories for Java and C++ /
> > Python.
> > > > > > > >
> > > > > > > > It is our goal that we perform this characterization analysis
> > > for all
> > > > > > of
> > > > > > > > our sketches.  By definition, the code that runs these
> > > > > characterization
> > > > > > > > tests is open-source so others can run these tests as well.
> We
> > > do
> > > > > not
> > > > > > have
> > > > > > > > formal releases of this code (because it is not production
> > code)
> > > and
> > > > > > it is
> > > > > > > > not published to Maven Central.
> > > > > > > >
> > > > > > > > === Meritocracy ===
> > > > > > > >
> > > > > > > > DataSketches was initially developed based on requirements
> > within
> > > > > > Yahoo. As
> > > > > > > > a project on GitHub, DataSketches has received contributions
> > from
> > > > > > numerous
> > > > > > > > individual developers from around the world, dedicated
> research
> > > work
> > > > > > from
> > > > > > > > senior scientists at Amazon and Visa, and academic
> researchers
> > > from
> > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > >
> > > > > > > > As a project under incubation, we are committed to expanding
> > our
> > > > > > effort to
> > > > > > > > build an environment which supports a meritocracy. We are
> > > focused on
> > > > > > > > engaging the community and other related projects for support
> > and
> > > > > > > > contributions. Moreover, we are committed to ensure
> > contributors
> > > and
> > > > > > > > committers to DataSketches come from a broad mix of
> > organizations
> > > > > > through a
> > > > > > > > merit-based decision process during incubation. We believe
> > > strongly
> > > > > in
> > > > > > the
> > > > > > > > DataSketches premise that fulfills the concept of a well
> > > engineered
> > > > > and
> > > > > > > > scientifically rigorous library that implements these
> powerful
> > > > > > algorithms
> > > > > > > > and are committed to growing an inclusive community of
> > > DataSketches
> > > > > > > > contributors and users.
> > > > > > > >
> > > > > > > > === Community ===
> > > > > > > >
> > > > > > > > Yahoo has a long history and active engagement in the Open
> > Source
> > > > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > > > > Panoptes,
> > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > TensorFlowOnSpark,
> > > > > > gifshot,
> > > > > > > > fluxible, as well as the creation, contribution and
> incubation
> > of
> > > > > many
> > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> Oozie,
> > > > > > Zookeeper,
> > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > > > >
> > > > > > > > Every day, DataSketches is actively used by a organizations
> and
> > > > > > > > institutions around the world for batch and stream processing
> > of
> > > > > data.
> > > > > > We
> > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > DataSketches-related work, grow the DataSketches community,
> and
> > > > > deepen
> > > > > > > > connections between DataSketches and other open source
> > projects.
> > > > > > > >
> > > > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > > > >
> > > > > > > > The core developers and contributors for DataSketches are
> from
> > > > > diverse
> > > > > > > > backgrounds, but primarily are scientists that love
> engineering
> > > and
> > > > > > > > engineers that love science. A large part of the value we
> bring
> > > comes
> > > > > > from
> > > > > > > > this synthesis.  These individuals have already contributed
> > > > > > substantially
> > > > > > > > to the code, algorithms, and/or mathematical proofs that form
> > the
> > > > > > basis of
> > > > > > > > the library.
> > > > > > > >
> > > > > > > > This core group also form the Initial Committers with write
> > > > > > permissions to
> > > > > > > > the repository. Those marked with (*) Meet weekly to plan the
> > > > > research
> > > > > > and
> > > > > > > > engineering direction of the project.
> > > > > > > >
> > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > >
> > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> Israel.
> > > > > > Interests:
> > > > > > > > distributed systems, scalable systems and platforms for big
> > data
> > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > >
> > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> Labs,
> > > > > > Sunnyvale,
> > > > > > > > California. Interests: algorithms, theoretical and applied
> > > > > mathematics,
> > > > > > > > encoding and compression theory, theoretical and applied
> > > performance
> > > > > > > > optimization.
> > > > > > > >
> > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> > Labs,
> > > Palo
> > > > > > Alto,
> > > > > > > > California. Manages the algorithms group at Amazon AI. We
> build
> > > > > > scalable
> > > > > > > > machine learning systems and algorithms which are used both
> > > > > internally
> > > > > > and
> > > > > > > > externally by customers of SageMaker, AWS's flagship machine
> > > learning
> > > > > > > > platform.
> > > > > > > >
> > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > > Interests:
> > > > > > > > Computational advertising, machine learning, speech
> > recognition,
> > > > > > > > data-driven analysis, large scale experimentation, big data,
> > > > > > stream/complex
> > > > > > > > event processing
> > > > > > > >
> > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > Computer
> > > > > > Science,
> > > > > > > > Georgetown University, Washington D.C. Interests: algorithms
> > and
> > > > > > > > computational complexity, complexity theory, quantum
> > algorithms,
> > > > > > private
> > > > > > > > data analysis, and learning theory, developing efficient
> > > streaming
> > > > > and
> > > > > > > > sketching algorithms
> > > > > > > >
> > > > > > > > ==== Engineers That Love Science ====
> > > > > > > >
> > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> > Snap.
> > > > > > Interests:
> > > > > > > > design and implementation of data storing and data processing
> > > > > > (distributed)
> > > > > > > > systems, performance optimization, CPU performance,
> mechanical
> > > > > > sympathy,
> > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > structures,
> > > > > > > > memory management, garbage collection algorithms, language
> > > design and
> > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > efficiency,
> > > > > > Linux,
> > > > > > > > code quality, code transformation, pure functional
> programming
> > > > > models,
> > > > > > > > Haskell.
> > > > > > > >
> > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> > > founder
> > > > > > of
> > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > Interests:
> > > > > > > > streaming algorithms, mathematics, computer science, high
> > > quality and
> > > > > > high
> > > > > > > > performance code for the analysis of massive data, bridging
> the
> > > > > divide
> > > > > > > > between theory and practice.
> > > > > > > >
> > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > > Sunnyvale,
> > > > > > > > California. Interests: applied mathematics, computer science,
> > big
> > > > > data,
> > > > > > > > distributed systems.
> > > > > > > >
> > > > > > > > === Introduction to Additional Interested Contributors ===
> > > > > > > >
> > > > > > > > These folks have been intermittently involved and
> contributed,
> > > but
> > > > > are
> > > > > > > > strong supporters of this project.
> > > > > > > >
> > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > >
> > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > Computer
> > > > > > Science,
> > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> matrix
> > > > > > > > approximation, streaming algorithms, randomized linear
> algebra.
> > > > > > > >
> > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> Ph.D.
> > > > > > Computer
> > > > > > > > Science, Research Instructor, Princeton University.
> Interests:
> > > > > > algorithmic
> > > > > > > > foundations of data science and machine learning, efficient
> > > methods
> > > > > for
> > > > > > > > processing and understanding large datasets, often working at
> > the
> > > > > > > > intersection of theoretical computer science, numerical
> linear
> > > > > > algebra, and
> > > > > > > > optimization.
> > > > > > > >
> > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > Computer
> > > > > > Science,
> > > > > > > > Professor, Warwick University, Warwick, England. Interests:
> all
> > > > > > aspects of
> > > > > > > > the "data lifecycle", from data collection and cleaning,
> > through
> > > > > > mining and
> > > > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > > > scientists
> > > > > > in
> > > > > > > > sketching algorithms)
> > > > > > > >
> > > > > > > > === Alignment ===
> > > > > > > >
> > > > > > > > The DataSketches library already provides integrations and
> > > example
> > > > > > code for
> > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> integrated
> > > into
> > > > > > Apache
> > > > > > > > Druid.
> > > > > > > >
> > > > > > > > == Known Risks ==
> > > > > > > >
> > > > > > > > The following subsections are specific risks that have been
> > > > > identified
> > > > > > by
> > > > > > > > the ASF that need to be addressed.
> > > > > > > >
> > > > > > > > === Risk: Orphaned Products ===
> > > > > > > >
> > > > > > > > The DataSketches library is presently used by a number of
> > > > > > organizations,
> > > > > > > > from small startups to Fortune 100 companies, to construct
> > > production
> > > > > > > > pipelines that must process and analyze massive data. Yahoo
> > has a
> > > > > > long-term
> > > > > > > > commitment to continue to advance the DataSketches library;
> > > moreover,
> > > > > > > > DataSketches is seeing increasing interest, development, and
> > > adoption
> > > > > > from
> > > > > > > > many diverse organizations from around the world. Due to its
> > > growing
> > > > > > > > adoption, we feel it is quite unlikely that this project
> would
> > > become
> > > > > > > > orphaned.
> > > > > > > >
> > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > >
> > > > > > > > Yahoo believes strongly in open source and the exchange of
> > > > > information
> > > > > > to
> > > > > > > > advance new ideas and work. Examples of this commitment are
> > > active
> > > > > open
> > > > > > > > source projects such as those mentioned above. With
> > > DataSketches, we
> > > > > > have
> > > > > > > > been increasingly open and forward-looking; we have
> published a
> > > > > number
> > > > > > of
> > > > > > > > papers about breakthrough developments in the science of
> > > streaming
> > > > > > > > algorithms (mentioned above) that also reference the
> > DataSketches
> > > > > > library.
> > > > > > > > Our submission to the Apache Software Foundation is a logical
> > > > > > extension of
> > > > > > > > our commitment to open source software.
> > > > > > > >
> > > > > > > > Key committers at Yahoo with strong open source backgrounds
> > > include
> > > > > > Aaron
> > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> Braginsky,
> > > > > Andrews
> > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> > Call,
> > > > > Daryn
> > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> Eshcar
> > > > > Hillel,
> > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > Perez-Sorrosal,
> > > > > > Gil
> > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> > James
> > > > > > Penick,
> > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > Eagles,
> > > > > > Kihwal
> > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > > Trelinski,
> > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > > > Natkovich,
> > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> Ruby
> > > Loo,
> > > > > > Ryan
> > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
> > > Chan,
> > > > > Sri
> > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> > more.
> > > > > > > >
> > > > > > > > All of our core developers are committed to learn about the
> > > Apache
> > > > > > process
> > > > > > > > and to give back to the community.
> > > > > > > >
> > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > >
> > > > > > > > The majority of committers in this proposal belong to Yahoo
> due
> > > to
> > > > > the
> > > > > > fact
> > > > > > > > that DataSketches has emerged from an internal Yahoo project.
> > > This
> > > > > > proposal
> > > > > > > > also includes developers and contributors from other
> companies,
> > > and
> > > > > > who are
> > > > > > > > actively involved with other Apache projects, such as Druid.
> > We
> > > > > > expect our
> > > > > > > > entry into incubation will allow us to expand the number of
> > > > > > individuals and
> > > > > > > > organizations participating in DataSketches development.
> > > > > > > >
> > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > >
> > > > > > > > Because the DataSketches library originated within Yahoo, it
> > has
> > > been
> > > > > > > > developed primarily by salaried Yahoo developers and we
> expect
> > > that
> > > > > to
> > > > > > > > continue to be the case near term. However, since we placed
> > this
> > > > > > library
> > > > > > > > into open-source we have had a number of significant
> > > contributions
> > > > > from
> > > > > > > > engineers and scientists from outside of Yahoo. We expect our
> > > > > reliance
> > > > > > on
> > > > > > > > Yahoo salaried developers will decrease over time.
> Nonetheless,
> > > Yahoo
> > > > > > is
> > > > > > > > committed to continue its strong support of this important
> > > project.
> > > > > > > >
> > > > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > > > >
> > > > > > > > DataSketches already directly interoperates with or utilizes
> > > several
> > > > > > > > existing Apache projects.
> > > > > > > >
> > > > > > > > * Build
> > > > > > > >    * Apache Maven
> > > > > > > >
> > > > > > > > * Integrations and adaptors for the following projects
> > naturally
> > > have
> > > > > > them
> > > > > > > > as dependencies
> > > > > > > >    * Apache Hive
> > > > > > > >    * Apache Pig
> > > > > > > >    * Apache Druid
> > > > > > > >    * Apache Spark
> > > > > > > >
> > > > > > > > * Additional dependencies for the above integrations and
> > adaptors
> > > > > > include
> > > > > > > >    * Apache Hadoop
> > > > > > > >    * Apache Commons (Math)
> > > > > > > >
> > > > > > > > There is no other Apache project that we are aware of that
> > > duplicates
> > > > > > the
> > > > > > > > functionality of the DataSketches library.
> > > > > > > >
> > > > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > > > > >
> > > > > > > > With this proposal we are not seeking attention or publicity.
> > > Rather,
> > > > > > we
> > > > > > > > firmly believe in the DataSketches library and concept and
> the
> > > > > ability
> > > > > > to
> > > > > > > > make the DataSketches library a powerful, yet simple-to-use
> > > toolkit
> > > > > for
> > > > > > > > data processing. While the DataSketches library has been open
> > > source,
> > > > > > we
> > > > > > > > believe putting code on GitHub can only go so far. We see the
> > > Apache
> > > > > > > > community, processes, and mission as critical for ensuring
> the
> > > > > > DataSketches
> > > > > > > > library is truly community-driven, positively impactful, and
> > > > > innovative
> > > > > > > > open source software. While Yahoo has taken a number of steps
> > to
> > > > > > advance
> > > > > > > > its various open source projects, we believe the DataSketches
> > > library
> > > > > > > > project is a great fit for the Apache Software Foundation due
> > to
> > > its
> > > > > > focus
> > > > > > > > on data processing and its relationships to existing ASF
> > > projects.
> > > > > > > >
> > > > > > > > === Risk: Cryptography ===
> > > > > > > >
> > > > > > > > DataSketches does not contain any cryptographic code and is
> > not a
> > > > > > > > cryptographic product.
> > > > > > > >
> > > > > > > > == Documentation ==
> > > > > > > >
> > > > > > > > The following documentation is relevant to this proposal.
> > > Relevant
> > > > > > portions
> > > > > > > > of the documentation will be contributed to the Apache
> > > DataSketches
> > > > > > > > project.
> > > > > > > >
> > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > >
> > > > > > > > * DataSketches website repository:
> > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > >
> > > > > > > > We will need an apache website for this documentation similar
> > to
> > > > > > > >
> > > > > > > > * https://datasketches.apache.org
> > > > > > > >
> > > > > > > > == Initial Source ==
> > > > > > > >
> > > > > > > > The initial source for DataSketches which we will submit to
> the
> > > > > Apache
> > > > > > > > Foundation will include a number of repositories which are
> > > currently
> > > > > > hosted
> > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > >
> > > > > > > > All github.com/datasketches repositories including:
> > > > > > > >
> > > > > > > > * Java
> > > > > > > >    * sketches-core: This repository has the core sketching
> > > classes,
> > > > > > which
> > > > > > > > are leveraged by some of the other repositories. This
> > repository
> > > has
> > > > > no
> > > > > > > > external dependencies outside of the DataSketches/memory
> > > repository,
> > > > > > Java
> > > > > > > > and TestNG for unit tests. This code is versioned and the
> > latest
> > > > > > release
> > > > > > > > can be obtained from Maven Central.
> > > > > > > >    * memory: Low level, high-performance memory
> data-structure
> > > > > > management
> > > > > > > > primarily for off-heap.
> > > > > > > >    * sketches-android: This is a new repository dedicated to
> > > sketches
> > > > > > > > designed to be run in a mobile client, such as a cell phone.
> It
> > > is
> > > > > > still in
> > > > > > > > development and should be considered experimental.
> > > > > > > >    * sketches-hive: This repository contains Hive UDFs and
> > UDAFs
> > > for
> > > > > > use
> > > > > > > > within Hadoop grid environments. This code has dependencies
> on
> > > > > > > > sketches-core as well as Hadoop and Hive. Users of this code
> > are
> > > > > > advised to
> > > > > > > > use Maven to bring in all the required dependencies. This
> code
> > is
> > > > > > versioned
> > > > > > > > and the latest release can be obtained from Maven Central.
> > > > > > > >    * sketches-pig: This repository contains Pig User Defined
> > > > > Functions
> > > > > > > > (UDF) for use within Hadoop grid environments. This code has
> > > > > > dependencies
> > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> code
> > > are
> > > > > > advised
> > > > > > > > to use Maven to bring in all the required dependencies. This
> > > code is
> > > > > > > > versioned and the latest release can be obtained from Maven
> > > Central.
> > > > > > > >    * sketches-vector: This is a new repository dedicated to
> > > sketches
> > > > > > for
> > > > > > > > vector and matrix operations. It is still somewhat
> > experimental.
> > > > > > > >    * characterization: This relatively new repository is for
> > code
> > > > > that
> > > > > > we
> > > > > > > > use to characterize the accuracy and speed performance of the
> > > > > sketches
> > > > > > in
> > > > > > > > the library and is constantly being updated. Examples of the
> > job
> > > > > > command
> > > > > > > > files used for various tests can be found in the
> > > src/main/resources
> > > > > > > > directory. Some of these tests can run for hours depending on
> > its
> > > > > > > > configuration.
> > > > > > > >    * experimental: This repository is an experimental staging
> > > area
> > > > > for
> > > > > > code
> > > > > > > > that will eventually end up in another repository. This code
> is
> > > not
> > > > > > > > versioned and not registered with Maven Central.
> > > > > > > >    * sketches-misc: Demos and other code not related to
> > > production
> > > > > > > > deployment
> > > > > > > >
> > > > > > > > * C++ and Python
> > > > > > > >    * sketches-core-cpp: This is the C++/Python companion to
> the
> > > Java
> > > > > > > > sketches-core. These implementations are binary compatible
> with
> > > their
> > > > > > > > counterparts in Java. In other words, a sketch created and
> > > stored in
> > > > > > C++
> > > > > > > > can be opened and read in Java and visa-versa. This site also
> > > has our
> > > > > > > > Python adaptors that basically wrap the C++ implementations,
> > > making
> > > > > the
> > > > > > > > high performance C++ implementations available from Python.
> > > > > > > >    * sketches-postgres: This site provides the
> > postgres-specific
> > > > > > adaptors
> > > > > > > > that wrap the C++ implementations making them available to
> the
> > > > > Postgres
> > > > > > > > database users.
> > > > > > > >    * characterization-cpp: This is the C++/Python companion
> to
> > > the
> > > > > Java
> > > > > > > > characterization repository.
> > > > > > > >    * experimental-cpp: This repository is an experimental
> > staging
> > > > > area
> > > > > > for
> > > > > > > > C++ code that will eventually end up in another repository.
> > > > > > > >
> > > > > > > > * Command-Line Tools
> > > > > > > >    * sketches-cmd
> > > > > > > >    * homebrew-sketches
> > > > > > > >    * homebrew-sketches-cmd
> > > > > > > >
> > > > > > > > These projects have always been Apache 2.0 licensed. We
> intend
> > to
> > > > > > bundle
> > > > > > > > all of these repositories since they are all complementary
> and
> > > should
> > > > > > be
> > > > > > > > maintained in one project. Prior to our submission, we will
> > > combine
> > > > > > all of
> > > > > > > > these projects into a new git repository.
> > > > > > > >
> > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > >
> > > > > > > > Contributors to the DataSketches project have also signed the
> > > Yahoo
> > > > > > > > Individual Contributor License Agreement (
> > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > in order to contribute to the project.
> > > > > > > >
> > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > trademark on
> > > > > > the
> > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > receive
> > > > > > during the
> > > > > > > > incubation process, we are open to renaming the project if
> > > necessary
> > > > > > for
> > > > > > > > trademark or other concerns, but we would prefer not to have
> to
> > > do
> > > > > > that.
> > > > > > > >
> > > > > > > > == External Dependencies ==
> > > > > > > >
> > > > > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > > > > Apache-compatible license. As we grow the DataSketches
> > community
> > > we
> > > > > > will
> > > > > > > > configure our build process to require and validate all
> > > contributions
> > > > > > and
> > > > > > > > dependencies are licensed under the Apache 2.0 license or are
> > > under
> > > > > an
> > > > > > > > Apache-compatible license.
> > > > > > > >
> > > > > > > > == Required Resources ==
> > > > > > > >
> > > > > > > > === Mailing Lists ===
> > > > > > > >
> > > > > > > > We currently use a mix of mailing lists. We will migrate our
> > > existing
> > > > > > > > mailing lists to the following:
> > > > > > > >
> > > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > * user@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > * private@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > > >
> > > > > > > > === Source Control ===
> > > > > > > >
> > > > > > > > The DataSketches team currently uses Git and would like to
> > > continue
> > > > > to
> > > > > > do
> > > > > > > > so. We request a Git repository for DataSketches with
> mirroring
> > > to
> > > > > > GitHub
> > > > > > > > enabled similar the following:
> > > > > > > >
> > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > >
> > > > > > > > === Issue Tracking ===
> > > > > > > >
> > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > DataSketches
> > > > > > project
> > > > > > > > is currently using the public GitHub issue tracker and the
> > public
> > > > > > Google
> > > > > > > > Groups forum/sketches-user for issue tracking and
> discussions.
> > We
> > > > > will
> > > > > > > > migrate and combine from these two sources to the Apache
> JIRA.
> > > > > > > >
> > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > >
> > > > > > > > == Initial Committers ==
> > > > > > > >
> > > > > > > > The following list of individuals have been extremely active
> in
> > > our
> > > > > > > > community and should have write (commit) permissions to the
> > > > > repository.
> > > > > > > >
> > > > > > > > * Eshcar Hillel                      [eshcar at verizonmedia
> > dot
> > > com]
> > > > > > > >
> > > > > > > > * Kevin Lang                    [langk at verizonmedia dot
> com]
> > > > > > > >
> > > > > > > > * Roman Leventov              [roman.leventov at
> c.metamarkets
> > > dot
> > > > > com]
> > > > > > > >
> > > > > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > > > > >
> > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot
> > com]
> > > > > > > >
> > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> > com] &
> > > > > > [leerho
> > > > > > > > at gmail dot com]
> > > > > > > >
> > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot
> com]
> > > > > > > >
> > > > > > > > * Justin Thaler                 [justin.thaler at georgetown
> > dot
> > > edu]
> > > > > > > >
> > > > > > > > == Affiliations ==
> > > > > > > >
> > > > > > > > The initial committers are from four organizations: Yahoo,
> > > Amazon,
> > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > >
> > > > > > > > === Champion ===
> > > > > > > > (Recommended to me: )
> > > > > > > >
> > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> [chenliang613
> > at
> > > > > > apache
> > > > > > > > dot org]
> > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > >
> > > > > > > > === Nominated Mentors ===
> > > > > > > > (Recommended to me: )
> > > > > > > >
> > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> [chenliang613
> > at
> > > > > > apache
> > > > > > > > dot org]
> > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > >
> > > > > > > > === Sponsoring Entity ===
> > > > > > > >
> > > > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > > > >
> > > > > > > > * Apache Druid. The incubating Apache Druid project might
> also
> > > be a
> > > > > > logical
> > > > > > > > sponsor. However, DataSketches has applications in many areas
> > of
> > > > > > computing
> > > > > > > > outside of Druid so our preference and recommendation is that
> > > > > > DataSketches
> > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > >
> > > > > > > > ________________
> > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > previously
> > > > > > acquired
> > > > > > > > AOL. The merged entity was originally called Oath, Inc., but
> > has
> > > > > > recently
> > > > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary
> of
> > > > > Verizon,
> > > > > > > > Inc.  Since Yahoo is the more recognized name, references in
> > this
> > > > > > document
> > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > kenn@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > The subject line has me interested already. Follow examples
> > > like
> > > > > this
> > > > > > > > > maybe?
> > > > > > > > >
> > > > > > > > > 1.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > 2.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > >
> > > > > > > > > Kenn
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > I'll try again ... :)
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > ted.dunning@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> It didn't make it again
> > > > > > > > > >>
> > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > leerho@gmail.com>
> > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > general-unsubscribe@incubator.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > > general-help@incubator.apache.org
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > > For additional commands, e-mail:
> general-help@incubator.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > --
> > > > From my cell phone.
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>

Re: DataSketches Proposal - Google Docs Link

Posted by leerho <le...@gmail.com>.

Ken,
Yahoo does not allow me to create a shared link outside our company, except
to individual email addresses.  So attempting to share it to the email
general@incubator.apache.org may not work.  Nonetheless, several
individuals were able to request access using their individual email
accounts and I was able to add them.  I will try to add you using
kenn@apache.org, but if that doesn't work, I may need a gmail or equivalent
account for you.

Lee.


On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <ke...@apache.org> wrote:

> I could not access that document. I suggest you need to turn on link
> sharing.
>
> Kenn
>
> On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com>
> wrote:
>
> > Try this link:
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> >
> >
> > On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > > Yes I will try that tomorrow.
> > >
> > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> > >
> > > > Can you share the Google doc with the proposal? Per Ted's advice, we
> > can
> > > > iterate quickly there and move it to the wiki when it becomes a bit
> > more
> > > > stable.
> > > >
> > > > Kenn
> > > >
> > > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for the offer.  i am a neophyte at this process and email
> > app!   I
> > > > > could use a lot of help getting this off the ground!  Also, I'm not
> > sure
> > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> > > > >
> > > > > Lee.
> > > > >
> > > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > > > Nice.
> > > > > >
> > > > > > I would very much like to help mentor this project, though you
> > already
> > > > > have
> > > > > > a couple good ones.
> > > > > >
> > > > > > I concur with incubator as sponsoring entity.
> > > > > >
> > > > > > Kenn (VP Apache Beam)
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> > > > > >
> > > > > > > I didn't realize that this mail list does not accept PDF files,
> > > > > apparently
> > > > > > > only text.  So let me try one more time ... :)  Please let me
> > know if
> > > > > > > this works!
> > > > > > >
> > > > > > >
> > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > >
> > > > > > > == Abstract ==
> > > > > > >
> > > > > > > DataSketches.GitHub.io is an open source, high-performance
> > library
> > > > of
> > > > > > > stochastic streaming algorithms commonly called "sketches" in
> the
> > > > data
> > > > > > > sciences. Sketches are small, stateful programs that process
> > massive
> > > > > data
> > > > > > > as a stream and can provide approximate answers, with
> > mathematical
> > > > > > > guarantees, to computationally difficult queries
> > orders-of-magnitude
> > > > > faster
> > > > > > > than traditional, exact methods.
> > > > > > >
> > > > > > > This proposal is to move DataSketches to the Apache Software
> > > > > > > Foundation(ASF) transferring ownership of its copyright
> > intellectual
> > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > officially
> > > > > known as
> > > > > > > Apache DataSketches and its evolution and governance would come
> > under
> > > > > the
> > > > > > > rules and guidance of the ASF.
> > > > > > >
> > > > > > > == Introduction ==
> > > > > > >
> > > > > > > The DataSketches library contains carefully crafted
> > implementations
> > > > of
> > > > > > > sketch algorithms that meet rigorous standards of quality and
> > > > > performance
> > > > > > > and provide capabilities required for large-scale production
> > systems
> > > > > that
> > > > > > > must process and analyze massive data. The DataSketches core
> > > > > repository is
> > > > > > > written in Java with a parallel core repository written in C++
> > that
> > > > > > > includes Python wrappers. The DataSketches library also
> includes
> > > > > special
> > > > > > > repositories for extending the core library for Apache Hive and
> > > > Apache
> > > > > Pig.
> > > > > > > The sketches developed in the different languages share a
> common
> > > > binary
> > > > > > > storage format so that sketches created and stored in Java, for
> > > > > example,
> > > > > > > can be fully used in C++, and visa versa.  Because the stored
> > sketch
> > > > > > > "images" are just a "blob" of bytes (similar to picture
> images),
> > they
> > > > > can
> > > > > > > be shared across many different systems, languages and
> platforms.
> > > > > > >
> > > > > > > The DataSketches documentation website,
> > > > https://datasketches.github.io
> > > > > ,
> > > > > > > includes general tutorials, a comprehensive research section
> with
> > > > > > > references to relevant academic papers, extensive examples for
> > using
> > > > > the
> > > > > > > core library directly as well as examples for accessing the
> > library
> > > > in
> > > > > > > Hive, Pig, and Apache Spark.
> > > > > > >
> > > > > > > The DataSketches library also includes a characterization
> > repository
> > > > > for
> > > > > > > long running test programs that are used for studying accuracy
> > and
> > > > > > > performance of these sketches over wide ranges of input
> > variables.
> > > > The
> > > > > data
> > > > > > > produced by these programs is used for generating the many
> > > > performance
> > > > > > > plots contained in the documentation website and for academic
> > > > > > > publications.
> > > > > > >
> > > > > > > The code repositories used for production are versioned and
> > published
> > > > > to
> > > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > > >
> > > > > > > The DataSketches library also includes several experimental
> > > > > repositories
> > > > > > > for use-cases outside the large-scale systems environments,
> such
> > as
> > > > > > > sketches for mobile, IoT devices (Android), command-line access
> > of
> > > > the
> > > > > > > sketch library, and an experimental repository for vector-based
> > > > > sketches
> > > > > > > that performs approximate Singular Value Decomposition (SVD)
> > analysis
> > > > > that
> > > > > > > could potentially be used in Machine Learning (ML)
> applications.
> > > > > > >
> > > > > > > == Background ==
> > > > > > >
> > > > > > > The DataSketches library was started in 2012 as internal Yahoo
> > > > project
> > > > > to
> > > > > > > dramatically reduce time and resources required for distinct
> > (unique)
> > > > > > > counting.  An extensive search on the Internet at the time
> > yielded a
> > > > > number
> > > > > > > of theoretical papers on stochastic streaming algorithms with
> > > > > pseudocode
> > > > > > > examples, but we did not find any usable open-source code of
> the
> > > > > quality we
> > > > > > > felt we needed for our internal production systems.  So we
> > started a
> > > > > small
> > > > > > > project (one person) to develop our own sketches working
> directly
> > > > from
> > > > > > > published theoretical papers.
> > > > > > >
> > > > > > > The DataSketches library was designed from the start with the
> > > > > objective of
> > > > > > > making these algorithms, usually only described in theoretical
> > > > papers,
> > > > > > > easily accessible to systems developers for use in our internal
> > > > > production
> > > > > > > systems. By necessity, the code had to be of the highest
> quality
> > and
> > > > > > > thoroughly tested. The wide variety of our internal production
> > > > systems
> > > > > > > drove the requirement that the sketch implementations had to
> > have an
> > > > > > > absolute minimum of external, run-time dependencies in order to
> > > > > simplify
> > > > > > > integration and troubleshooting.
> > > > > > >
> > > > > > > Our internal experiments demonstrated dramatic positive impact
> > on the
> > > > > > > performance of our systems.  As a result, the DataSketches
> > library
> > > > > quickly
> > > > > > > evolved to include different types of sketches for different
> > types of
> > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > algorithms,
> > > > > > > quantile/histogram algorithms, and weighted and unweighted
> > sampling
> > > > > > > algorithms.
> > > > > > >
> > > > > > > We quickly discovered that developing these sketch algorithms
> to
> > be
> > > > > truly
> > > > > > > robust in production environments is quite difficult and
> requires
> > > > deep
> > > > > > > understanding of the underlying mathematics and statistics as
> > well as
> > > > > > > extensive experience in developing high quality code for 24/7
> > > > > production
> > > > > > > systems. This is a difficult combination of skills for any one
> > > > > organization
> > > > > > > to collect and maintain over time. It became clear that this
> > > > technology
> > > > > > > needed a community larger than Yahoo to evolve.  In November,
> > 2015,
> > > > > this
> > > > > > > factor, along with Yahoo’s strong experience and support of
> open
> > > > > source,
> > > > > > > led to the decision to open source this technology under an
> > Apache
> > > > 2.0
> > > > > > > license on GitHub. Since that time our community has expanded
> > > > > considerably
> > > > > > > and the key contributors to this effort includes leading
> research
> > > > > > > scientists from a number of universities as well as
> > practitioners and
> > > > > > > researchers from a number of major corporations. The core of
> this
> > > > > group is
> > > > > > > very active as we meet weekly to discuss research directions
> and
> > > > > > > engineering priorities.
> > > > > > >
> > > > > > > It is important to note that our internal systems at Yahoo use
> > the
> > > > > current
> > > > > > > public GitHub open source DataSketches library and not an
> > internal
> > > > > version
> > > > > > > of the code.
> > > > > > >
> > > > > > > The close collaboration of scientific research and engineering
> > > > > development
> > > > > > > experience with actual massive-data processing systems has also
> > > > > produced
> > > > > > > new research publications in the field of stochastic streaming
> > > > > algorithms,
> > > > > > > for example:
> > > > > > >
> > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> > > > > Rhodes, and
> > > > > > > Justin Thaler. A high-performance algorithm for identifying
> > frequent
> > > > > items
> > > > > > > in data streams. In ACM IMC 2017.
> > > > > > >
> > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > Thaler. A
> > > > > > > framework for estimating stream expression cardinalities. In
> > > > *EDBT/ICDT
> > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > >
> > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > Frequent
> > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > Proceedings
> > > > > ‘16,
> > > > > > > pages 845-854, 2016.
> > > > > > >
> > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> > quantile
> > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> > 71–78,
> > > > > 2016.
> > > > > > >
> > > > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
> > > > > cardinality
> > > > > > > estimation algorithm. arXiv preprint
> > > > https://arxiv.org/abs/1708.06839,
> > > > > > > 2017.
> > > > > > >
> > > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In
> ACM
> > KDD
> > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > >
> > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> Jonathan
> > > > > Ullman.
> > > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> > > > > Proceedings
> > > > > > > ‘16, pages 441–454, 2016.
> > > > > > >
> > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > > Hierarchical
> > > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > > > Proceedings
> > > > > > > ‘12, pages 160–174, 2012.
> > > > > > >
> > > > > > > == The Rationale for Sketches ==
> > > > > > >
> > > > > > > In the analysis of big data there are often problem queries
> that
> > > > don’t
> > > > > > > scale because they require huge compute resources and time to
> > > > generate
> > > > > > > exact results. Examples include count distinct, quantiles, most
> > > > > frequent
> > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > >
> > > > > > > If we can loosen the requirement of “exact” results from our
> > queries
> > > > > and be
> > > > > > > satisfied with approximate results, within some well understood
> > > > bounds
> > > > > of
> > > > > > > error, there is an entire branch of mathematics and data
> science
> > that
> > > > > has
> > > > > > > evolved around developing algorithms that can produce
> approximate
> > > > > results
> > > > > > > with mathematically well-defined error properties.
> > > > > > >
> > > > > > > With the additional requirements that these algorithms must be
> > small
> > > > > > > (compared to the size of the input data), sublinear (the size
> of
> > the
> > > > > sketch
> > > > > > > must grow at a slower rate than the size of the input stream),
> > > > > streaming
> > > > > > > (they can only touch each data item once), and mergeable
> > (suitable
> > > > for
> > > > > > > distributed processing), defines a class of algorithms that can
> > be
> > > > > > > described as small, stochastic, streaming, sublinear mergeable
> > > > > algorithms,
> > > > > > > commonly called sketches (they also have other names, but we
> > will use
> > > > > the
> > > > > > > term sketches from here on).
> > > > > > >
> > > > > > > To be truly streaming and be able to process data in a single
> > pass,
> > > > > > > sketches must make absolute minimum assumptions about the input
> > > > stream.
> > > > > > > This is critically important, as there is no “second chance” to
> > > > > process the
> > > > > > > data.
> > > > > > >
> > > > > > > For example, sketches should not make assumptions about the
> > order of
> > > > > stream
> > > > > > > items, the stream length, the dynamic range of values, or the
> > > > > distribution
> > > > > > > of item occurrence frequencies. Sketches should be tolerant of
> > NaNs,
> > > > > Nulls
> > > > > > > and empty objects. About the only thing that the sketch needs
> to
> > know
> > > > > about
> > > > > > > the stream is how to extract items from it and what type the
> > item is,
> > > > > e.g.,
> > > > > > > is it a numeric value or a string.
> > > > > > >
> > > > > > > As far as the sketch is concerned, the input stream is a
> > sequence of
> > > > > items
> > > > > > > in some unknown random order with unknown random values.
> > > > > > >
> > > > > > > The sketch is essentially a complex state machine and combined
> > with
> > > > the
> > > > > > > random input stream defines a stochastic process. We then apply
> > > > > > > probabilistic methods to interpret the states of the stochastic
> > > > > process in
> > > > > > > order to extract useful information about the input stream
> > itself.
> > > > The
> > > > > > > resulting information will be approximate, but we also use
> > additional
> > > > > > > probabilistic methods to extract an estimate of the likely
> > > > probability
> > > > > > > distribution of error.
> > > > > > >
> > > > > > > There is a significant scientific contribution here that is
> > defining
> > > > > the
> > > > > > > state machine, understanding the resulting stochastic process,
> > > > > developing
> > > > > > > the probabilistic methods, and proving mathematically, that it
> > all
> > > > > works!
> > > > > > > This is why the scientific contributors to this project are a
> > > > critical
> > > > > and
> > > > > > > strategic component to our success.  The development engineers
> > > > > translate
> > > > > > > the concepts of the proposed state machine and probabilistic
> > methods
> > > > > into
> > > > > > > production-quality code. Even more important, they work closely
> > with
> > > > > the
> > > > > > > scientists, feeding back system and user requirements, which
> > leads
> > > > not
> > > > > only
> > > > > > > to superior product design, but to new science as well.  A
> > number of
> > > > > > > scientific papers our members have published (see above) is a
> > direct
> > > > > result
> > > > > > > of this close collaboration.
> > > > > > >
> > > > > > > Because sketches are small they can be processed extremely
> fast,
> > > > often
> > > > > many
> > > > > > > orders-of-magnitude faster than traditional exact computations.
> > For
> > > > > > > interactive queries there may not be other viable alternatives,
> > and
> > > > in
> > > > > the
> > > > > > > case of real-time analysis, sketches are the only known
> solution.
> > > > > > >
> > > > > > > For any system that needs to extract useful information from
> > massive
> > > > > data
> > > > > > > sketches are essential tools that should be tightly integrated
> > into
> > > > the
> > > > > > > system’s analysis capabilities. This technology has helped
> Yahoo
> > > > > > > successfully reduce data processing times from days to hours or
> > > > > minutes on
> > > > > > > a number of its internal platforms and has enabled subsecond
> > queries
> > > > on
> > > > > > > real-time platforms that would have been infeasible without
> > sketches.
> > > > > > > The Rationale for Apache DataSketches
> > > > > > > Other open source implementations of sketch algorithms can be
> > found
> > > > on
> > > > > the
> > > > > > > Internet. However, we have not yet found any open source
> > > > > implementations
> > > > > > > that are as comprehensive, engineered with the quality required
> > for
> > > > > > > production systems, and with usable and guaranteed error
> > properties.
> > > > > Large
> > > > > > > Internet companies, such as Google and Facebook, have published
> > > > papers
> > > > > on
> > > > > > > sketching, however, their implementations of their published
> > > > > algorithms are
> > > > > > > proprietary and not available as open source.
> > > > > > >
> > > > > > > The DataSketches library already provides integrations with a
> > number
> > > > of
> > > > > > > major Apache data processing platforms such as Apache Hive,
> > Apache
> > > > Pig,
> > > > > > > Apache Spark and Apache Druid, and is also integrated with a
> > number
> > > > of
> > > > > > > other open source data processing platforms such as Splice
> > Machine,
> > > > > GCHQ
> > > > > > > Gaffer and PostgreSQL.
> > > > > > >
> > > > > > > We believe that having DataSketches as an Apache project will
> > provide
> > > > > an
> > > > > > > immediate, worthwhile, and substantial contribution to the open
> > > > source
> > > > > > > community, will have a better opportunity to provide a
> meaningful
> > > > > > > contribution to both the science and engineering of sketching
> > > > > algorithms,
> > > > > > > and integrate with other Apache projects.  In addition, this
> is a
> > > > > > > significant opportunity for Apache to be the "go-to"
> destination
> > for
> > > > > users
> > > > > > > that want to leverage this exciting technology.
> > > > > > >
> > > > > > > == Initial Goals ==
> > > > > > >
> > > > > > > We are breaking our initial goals into short-term (2-6 months)
> > and
> > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > >
> > > > > > > Our short-term goals include:
> > > > > > >
> > > > > > > * Understanding and adapting to the Apache development process
> > and
> > > > > > > structures.
> > > > > > >
> > > > > > > * Start refactoring codebase and move various DataSketches
> > > > repositories
> > > > > > > code to Apache Git repository.
> > > > > > >
> > > > > > > * Continue development of new features, functions, and fixes.
> > > > > > >
> > > > > > > * Specific sub-projects (e.g., C++ and Python) will continue to
> > be
> > > > > > > developed and expanded.
> > > > > > >
> > > > > > >
> > > > > > > The intermediate to long term goals include:
> > > > > > >
> > > > > > > * Completing the design and implementation of the C++ sketches
> to
> > > > > > > complement what is already available in Java, and the Python
> > wrappers
> > > > > of
> > > > > > > those C++ sketches.
> > > > > > >
> > > > > > > * Expanding the C++ build framework to include Windows and the
> > > > popular
> > > > > > > Linux variants.
> > > > > > >
> > > > > > > * Continued engagement with the scientific research community
> on
> > the
> > > > > > > development of new algorithms for computationally difficult
> > problems
> > > > > that
> > > > > > > heretofore have not had a sketching solution.
> > > > > > >
> > > > > > > == Current Status ==
> > > > > > >
> > > > > > > The DataSketches GitHub project has been quite successful.  As
> of
> > > > this
> > > > > > > writing (Feb, 2019) the number of downloads measured by the
> Nexus
> > > > > > > Repository Manager at https://oss.sonatype.org has grown by
> > nearly a
> > > > > > > factor
> > > > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > > > DataSketches/sketches-core repository has about 560 stars and
> 141
> > > > > forks,
> > > > > > > which is pretty good for a highly specialized library.
> > > > > > >
> > > > > > > === Development Practices ===
> > > > > > >
> > > > > > > ==== Source Control ====
> > > > > > >
> > > > > > > All of our developers have extensive experience with Git
> version
> > > > > control
> > > > > > > and follow accepted practices for use of Pull Requests (PRs),
> > code
> > > > > reviews
> > > > > > > and commits to master, for example.
> > > > > > >
> > > > > > > ==== Testing ====
> > > > > > >
> > > > > > > Sketches, by their nature are probabilistic programs and don’t
> > > > > necessarily
> > > > > > > behave deterministically.  For some of the sketches we
> > intentionally
> > > > > insert
> > > > > > > random noise into the code as this gives us the mathematical
> > > > properties
> > > > > > > that we need to guarantee accuracy.  This can make the behavior
> > of
> > > > > these
> > > > > > > algorithms quite unintuitive and provides significant
> challenges
> > to
> > > > the
> > > > > > > developer who wishes to test these algorithms for correctness.
> > As a
> > > > > result,
> > > > > > > our testing strategy includes two major components: unit tests,
> > and
> > > > > > > characterization tests.
> > > > > > >
> > > > > > > ===== Unit Testing =====
> > > > > > >
> > > > > > > Our unit tests are primarily quick tests to make sure that we
> > > > exercise
> > > > > all
> > > > > > > critical paths in the code and that key branches are executed
> > > > > correctly. It
> > > > > > > is important that they execute relatively fast as they are
> > generally
> > > > > run on
> > > > > > > every code build. The sketches-core repository alone has about
> 22
> > > > > thousand
> > > > > > > statements, over 1300 unit tests and code coverage of about
> > 98.2% as
> > > > > > > measured by Atlassian/Clover.  It is our goal for all of our
> code
> > > > > > > repositories that are used in production that they have code
> > coverage
> > > > > > > greater than 90%.
> > > > > > >
> > > > > > > ===== Characterization Testing =====
> > > > > > >
> > > > > > > In order to test the probabilistic methods that are used to
> > interpret
> > > > > the
> > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > characterization
> > > > > > > repository that is dedicated to this.  To measure accuracy, for
> > > > > example,
> > > > > > > requires running thousands of trials at each of many different
> > points
> > > > > along
> > > > > > > the domain axis. Each trial compares its estimated results
> > against a
> > > > > known
> > > > > > > exact result producing an error for that trial.  These error
> > > > > measurements
> > > > > > > are then fed into our Quantiles sketch to capture the actual
> > > > > distribution
> > > > > > > of error at that point along the axis. We then select quantile
> > > > contours
> > > > > > > across all the distributions at points along the axis.  These
> > > > contours
> > > > > can
> > > > > > > then be plotted to reveal the shape of the actual error
> > distribution.
> > > > > These
> > > > > > > distributions are not at all Gaussian, in fact they can be
> quite
> > > > > complex.
> > > > > > > Nonetheless, these distributions are then checked against our
> > > > > statistical
> > > > > > > guarantees inherent to the specific sketch algorithm and its
> > > > > parameters.
> > > > > > > There are many examples of these characterization error
> > distributions
> > > > > on
> > > > > > > our website. The runtimes of these tests can be very long and
> can
> > > > range
> > > > > > > from many minutes to hours, and some can run for days.
> > Currently, we
> > > > > have
> > > > > > > separate characterization repositories for Java and C++ /
> Python.
> > > > > > >
> > > > > > > It is our goal that we perform this characterization analysis
> > for all
> > > > > of
> > > > > > > our sketches.  By definition, the code that runs these
> > > > characterization
> > > > > > > tests is open-source so others can run these tests as well.  We
> > do
> > > > not
> > > > > have
> > > > > > > formal releases of this code (because it is not production
> code)
> > and
> > > > > it is
> > > > > > > not published to Maven Central.
> > > > > > >
> > > > > > > === Meritocracy ===
> > > > > > >
> > > > > > > DataSketches was initially developed based on requirements
> within
> > > > > Yahoo. As
> > > > > > > a project on GitHub, DataSketches has received contributions
> from
> > > > > numerous
> > > > > > > individual developers from around the world, dedicated research
> > work
> > > > > from
> > > > > > > senior scientists at Amazon and Visa, and academic researchers
> > from
> > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > >
> > > > > > > As a project under incubation, we are committed to expanding
> our
> > > > > effort to
> > > > > > > build an environment which supports a meritocracy. We are
> > focused on
> > > > > > > engaging the community and other related projects for support
> and
> > > > > > > contributions. Moreover, we are committed to ensure
> contributors
> > and
> > > > > > > committers to DataSketches come from a broad mix of
> organizations
> > > > > through a
> > > > > > > merit-based decision process during incubation. We believe
> > strongly
> > > > in
> > > > > the
> > > > > > > DataSketches premise that fulfills the concept of a well
> > engineered
> > > > and
> > > > > > > scientifically rigorous library that implements these powerful
> > > > > algorithms
> > > > > > > and are committed to growing an inclusive community of
> > DataSketches
> > > > > > > contributors and users.
> > > > > > >
> > > > > > > === Community ===
> > > > > > >
> > > > > > > Yahoo has a long history and active engagement in the Open
> Source
> > > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > > > Panoptes,
> > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> TensorFlowOnSpark,
> > > > > gifshot,
> > > > > > > fluxible, as well as the creation, contribution and incubation
> of
> > > > many
> > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> > > > > Zookeeper,
> > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > > >
> > > > > > > Every day, DataSketches is actively used by a organizations and
> > > > > > > institutions around the world for batch and stream processing
> of
> > > > data.
> > > > > We
> > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > DataSketches-related work, grow the DataSketches community, and
> > > > deepen
> > > > > > > connections between DataSketches and other open source
> projects.
> > > > > > >
> > > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > > >
> > > > > > > The core developers and contributors for DataSketches are from
> > > > diverse
> > > > > > > backgrounds, but primarily are scientists that love engineering
> > and
> > > > > > > engineers that love science. A large part of the value we bring
> > comes
> > > > > from
> > > > > > > this synthesis.  These individuals have already contributed
> > > > > substantially
> > > > > > > to the code, algorithms, and/or mathematical proofs that form
> the
> > > > > basis of
> > > > > > > the library.
> > > > > > >
> > > > > > > This core group also form the Initial Committers with write
> > > > > permissions to
> > > > > > > the repository. Those marked with (*) Meet weekly to plan the
> > > > research
> > > > > and
> > > > > > > engineering direction of the project.
> > > > > > >
> > > > > > > ==== Scientists That Love Engineering ====
> > > > > > >
> > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> > > > > Interests:
> > > > > > > distributed systems, scalable systems and platforms for big
> data
> > > > > > > processing, concurrent algorithms and data structures,
> > > > > > >
> > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> > > > > Sunnyvale,
> > > > > > > California. Interests: algorithms, theoretical and applied
> > > > mathematics,
> > > > > > > encoding and compression theory, theoretical and applied
> > performance
> > > > > > > optimization.
> > > > > > >
> > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> Labs,
> > Palo
> > > > > Alto,
> > > > > > > California. Manages the algorithms group at Amazon AI. We build
> > > > > scalable
> > > > > > > machine learning systems and algorithms which are used both
> > > > internally
> > > > > and
> > > > > > > externally by customers of SageMaker, AWS's flagship machine
> > learning
> > > > > > > platform.
> > > > > > >
> > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > Interests:
> > > > > > > Computational advertising, machine learning, speech
> recognition,
> > > > > > > data-driven analysis, large scale experimentation, big data,
> > > > > stream/complex
> > > > > > > event processing
> > > > > > >
> > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> Computer
> > > > > Science,
> > > > > > > Georgetown University, Washington D.C. Interests: algorithms
> and
> > > > > > > computational complexity, complexity theory, quantum
> algorithms,
> > > > > private
> > > > > > > data analysis, and learning theory, developing efficient
> > streaming
> > > > and
> > > > > > > sketching algorithms
> > > > > > >
> > > > > > > ==== Engineers That Love Science ====
> > > > > > >
> > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> Snap.
> > > > > Interests:
> > > > > > > design and implementation of data storing and data processing
> > > > > (distributed)
> > > > > > > systems, performance optimization, CPU performance, mechanical
> > > > > sympathy,
> > > > > > > JVM performance, API design, databases, (concurrent) data
> > structures,
> > > > > > > memory management, garbage collection algorithms, language
> > design and
> > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > efficiency,
> > > > > Linux,
> > > > > > > code quality, code transformation, pure functional programming
> > > > models,
> > > > > > > Haskell.
> > > > > > >
> > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> > founder
> > > > > of
> > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > Interests:
> > > > > > > streaming algorithms, mathematics, computer science, high
> > quality and
> > > > > high
> > > > > > > performance code for the analysis of massive data, bridging the
> > > > divide
> > > > > > > between theory and practice.
> > > > > > >
> > > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > Sunnyvale,
> > > > > > > California. Interests: applied mathematics, computer science,
> big
> > > > data,
> > > > > > > distributed systems.
> > > > > > >
> > > > > > > === Introduction to Additional Interested Contributors ===
> > > > > > >
> > > > > > > These folks have been intermittently involved and contributed,
> > but
> > > > are
> > > > > > > strong supporters of this project.
> > > > > > >
> > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > >
> > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> Computer
> > > > > Science,
> > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > > > > > approximation, streaming algorithms, randomized linear algebra.
> > > > > > >
> > > > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> > > > > Computer
> > > > > > > Science, Research Instructor, Princeton University. Interests:
> > > > > algorithmic
> > > > > > > foundations of data science and machine learning, efficient
> > methods
> > > > for
> > > > > > > processing and understanding large datasets, often working at
> the
> > > > > > > intersection of theoretical computer science, numerical linear
> > > > > algebra, and
> > > > > > > optimization.
> > > > > > >
> > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > Computer
> > > > > Science,
> > > > > > > Professor, Warwick University, Warwick, England. Interests: all
> > > > > aspects of
> > > > > > > the "data lifecycle", from data collection and cleaning,
> through
> > > > > mining and
> > > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > > scientists
> > > > > in
> > > > > > > sketching algorithms)
> > > > > > >
> > > > > > > === Alignment ===
> > > > > > >
> > > > > > > The DataSketches library already provides integrations and
> > example
> > > > > code for
> > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated
> > into
> > > > > Apache
> > > > > > > Druid.
> > > > > > >
> > > > > > > == Known Risks ==
> > > > > > >
> > > > > > > The following subsections are specific risks that have been
> > > > identified
> > > > > by
> > > > > > > the ASF that need to be addressed.
> > > > > > >
> > > > > > > === Risk: Orphaned Products ===
> > > > > > >
> > > > > > > The DataSketches library is presently used by a number of
> > > > > organizations,
> > > > > > > from small startups to Fortune 100 companies, to construct
> > production
> > > > > > > pipelines that must process and analyze massive data. Yahoo
> has a
> > > > > long-term
> > > > > > > commitment to continue to advance the DataSketches library;
> > moreover,
> > > > > > > DataSketches is seeing increasing interest, development, and
> > adoption
> > > > > from
> > > > > > > many diverse organizations from around the world. Due to its
> > growing
> > > > > > > adoption, we feel it is quite unlikely that this project would
> > become
> > > > > > > orphaned.
> > > > > > >
> > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > >
> > > > > > > Yahoo believes strongly in open source and the exchange of
> > > > information
> > > > > to
> > > > > > > advance new ideas and work. Examples of this commitment are
> > active
> > > > open
> > > > > > > source projects such as those mentioned above. With
> > DataSketches, we
> > > > > have
> > > > > > > been increasingly open and forward-looking; we have published a
> > > > number
> > > > > of
> > > > > > > papers about breakthrough developments in the science of
> > streaming
> > > > > > > algorithms (mentioned above) that also reference the
> DataSketches
> > > > > library.
> > > > > > > Our submission to the Apache Software Foundation is a logical
> > > > > extension of
> > > > > > > our commitment to open source software.
> > > > > > >
> > > > > > > Key committers at Yahoo with strong open source backgrounds
> > include
> > > > > Aaron
> > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> > > > Andrews
> > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> Call,
> > > > Daryn
> > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
> > > > Hillel,
> > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > Perez-Sorrosal,
> > > > > Gil
> > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> James
> > > > > Penick,
> > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> Eagles,
> > > > > Kihwal
> > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > Trelinski,
> > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > > Natkovich,
> > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby
> > Loo,
> > > > > Ryan
> > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
> > Chan,
> > > > Sri
> > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> more.
> > > > > > >
> > > > > > > All of our core developers are committed to learn about the
> > Apache
> > > > > process
> > > > > > > and to give back to the community.
> > > > > > >
> > > > > > > === Risk: Homogeneous Developers ===
> > > > > > >
> > > > > > > The majority of committers in this proposal belong to Yahoo due
> > to
> > > > the
> > > > > fact
> > > > > > > that DataSketches has emerged from an internal Yahoo project.
> > This
> > > > > proposal
> > > > > > > also includes developers and contributors from other companies,
> > and
> > > > > who are
> > > > > > > actively involved with other Apache projects, such as Druid.
> We
> > > > > expect our
> > > > > > > entry into incubation will allow us to expand the number of
> > > > > individuals and
> > > > > > > organizations participating in DataSketches development.
> > > > > > >
> > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > >
> > > > > > > Because the DataSketches library originated within Yahoo, it
> has
> > been
> > > > > > > developed primarily by salaried Yahoo developers and we expect
> > that
> > > > to
> > > > > > > continue to be the case near term. However, since we placed
> this
> > > > > library
> > > > > > > into open-source we have had a number of significant
> > contributions
> > > > from
> > > > > > > engineers and scientists from outside of Yahoo. We expect our
> > > > reliance
> > > > > on
> > > > > > > Yahoo salaried developers will decrease over time. Nonetheless,
> > Yahoo
> > > > > is
> > > > > > > committed to continue its strong support of this important
> > project.
> > > > > > >
> > > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > > >
> > > > > > > DataSketches already directly interoperates with or utilizes
> > several
> > > > > > > existing Apache projects.
> > > > > > >
> > > > > > > * Build
> > > > > > >    * Apache Maven
> > > > > > >
> > > > > > > * Integrations and adaptors for the following projects
> naturally
> > have
> > > > > them
> > > > > > > as dependencies
> > > > > > >    * Apache Hive
> > > > > > >    * Apache Pig
> > > > > > >    * Apache Druid
> > > > > > >    * Apache Spark
> > > > > > >
> > > > > > > * Additional dependencies for the above integrations and
> adaptors
> > > > > include
> > > > > > >    * Apache Hadoop
> > > > > > >    * Apache Commons (Math)
> > > > > > >
> > > > > > > There is no other Apache project that we are aware of that
> > duplicates
> > > > > the
> > > > > > > functionality of the DataSketches library.
> > > > > > >
> > > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > > > >
> > > > > > > With this proposal we are not seeking attention or publicity.
> > Rather,
> > > > > we
> > > > > > > firmly believe in the DataSketches library and concept and the
> > > > ability
> > > > > to
> > > > > > > make the DataSketches library a powerful, yet simple-to-use
> > toolkit
> > > > for
> > > > > > > data processing. While the DataSketches library has been open
> > source,
> > > > > we
> > > > > > > believe putting code on GitHub can only go so far. We see the
> > Apache
> > > > > > > community, processes, and mission as critical for ensuring the
> > > > > DataSketches
> > > > > > > library is truly community-driven, positively impactful, and
> > > > innovative
> > > > > > > open source software. While Yahoo has taken a number of steps
> to
> > > > > advance
> > > > > > > its various open source projects, we believe the DataSketches
> > library
> > > > > > > project is a great fit for the Apache Software Foundation due
> to
> > its
> > > > > focus
> > > > > > > on data processing and its relationships to existing ASF
> > projects.
> > > > > > >
> > > > > > > === Risk: Cryptography ===
> > > > > > >
> > > > > > > DataSketches does not contain any cryptographic code and is
> not a
> > > > > > > cryptographic product.
> > > > > > >
> > > > > > > == Documentation ==
> > > > > > >
> > > > > > > The following documentation is relevant to this proposal.
> > Relevant
> > > > > portions
> > > > > > > of the documentation will be contributed to the Apache
> > DataSketches
> > > > > > > project.
> > > > > > >
> > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > >
> > > > > > > * DataSketches website repository:
> > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > >
> > > > > > > We will need an apache website for this documentation similar
> to
> > > > > > >
> > > > > > > * https://datasketches.apache.org
> > > > > > >
> > > > > > > == Initial Source ==
> > > > > > >
> > > > > > > The initial source for DataSketches which we will submit to the
> > > > Apache
> > > > > > > Foundation will include a number of repositories which are
> > currently
> > > > > hosted
> > > > > > > under the GitHub.com/datasketches organization:
> > > > > > >
> > > > > > > All github.com/datasketches repositories including:
> > > > > > >
> > > > > > > * Java
> > > > > > >    * sketches-core: This repository has the core sketching
> > classes,
> > > > > which
> > > > > > > are leveraged by some of the other repositories. This
> repository
> > has
> > > > no
> > > > > > > external dependencies outside of the DataSketches/memory
> > repository,
> > > > > Java
> > > > > > > and TestNG for unit tests. This code is versioned and the
> latest
> > > > > release
> > > > > > > can be obtained from Maven Central.
> > > > > > >    * memory: Low level, high-performance memory data-structure
> > > > > management
> > > > > > > primarily for off-heap.
> > > > > > >    * sketches-android: This is a new repository dedicated to
> > sketches
> > > > > > > designed to be run in a mobile client, such as a cell phone. It
> > is
> > > > > still in
> > > > > > > development and should be considered experimental.
> > > > > > >    * sketches-hive: This repository contains Hive UDFs and
> UDAFs
> > for
> > > > > use
> > > > > > > within Hadoop grid environments. This code has dependencies on
> > > > > > > sketches-core as well as Hadoop and Hive. Users of this code
> are
> > > > > advised to
> > > > > > > use Maven to bring in all the required dependencies. This code
> is
> > > > > versioned
> > > > > > > and the latest release can be obtained from Maven Central.
> > > > > > >    * sketches-pig: This repository contains Pig User Defined
> > > > Functions
> > > > > > > (UDF) for use within Hadoop grid environments. This code has
> > > > > dependencies
> > > > > > > on sketches-core as well as Hadoop and Pig. Users of this code
> > are
> > > > > advised
> > > > > > > to use Maven to bring in all the required dependencies. This
> > code is
> > > > > > > versioned and the latest release can be obtained from Maven
> > Central.
> > > > > > >    * sketches-vector: This is a new repository dedicated to
> > sketches
> > > > > for
> > > > > > > vector and matrix operations. It is still somewhat
> experimental.
> > > > > > >    * characterization: This relatively new repository is for
> code
> > > > that
> > > > > we
> > > > > > > use to characterize the accuracy and speed performance of the
> > > > sketches
> > > > > in
> > > > > > > the library and is constantly being updated. Examples of the
> job
> > > > > command
> > > > > > > files used for various tests can be found in the
> > src/main/resources
> > > > > > > directory. Some of these tests can run for hours depending on
> its
> > > > > > > configuration.
> > > > > > >    * experimental: This repository is an experimental staging
> > area
> > > > for
> > > > > code
> > > > > > > that will eventually end up in another repository. This code is
> > not
> > > > > > > versioned and not registered with Maven Central.
> > > > > > >    * sketches-misc: Demos and other code not related to
> > production
> > > > > > > deployment
> > > > > > >
> > > > > > > * C++ and Python
> > > > > > >    * sketches-core-cpp: This is the C++/Python companion to the
> > Java
> > > > > > > sketches-core. These implementations are binary compatible with
> > their
> > > > > > > counterparts in Java. In other words, a sketch created and
> > stored in
> > > > > C++
> > > > > > > can be opened and read in Java and visa-versa. This site also
> > has our
> > > > > > > Python adaptors that basically wrap the C++ implementations,
> > making
> > > > the
> > > > > > > high performance C++ implementations available from Python.
> > > > > > >    * sketches-postgres: This site provides the
> postgres-specific
> > > > > adaptors
> > > > > > > that wrap the C++ implementations making them available to the
> > > > Postgres
> > > > > > > database users.
> > > > > > >    * characterization-cpp: This is the C++/Python companion to
> > the
> > > > Java
> > > > > > > characterization repository.
> > > > > > >    * experimental-cpp: This repository is an experimental
> staging
> > > > area
> > > > > for
> > > > > > > C++ code that will eventually end up in another repository.
> > > > > > >
> > > > > > > * Command-Line Tools
> > > > > > >    * sketches-cmd
> > > > > > >    * homebrew-sketches
> > > > > > >    * homebrew-sketches-cmd
> > > > > > >
> > > > > > > These projects have always been Apache 2.0 licensed. We intend
> to
> > > > > bundle
> > > > > > > all of these repositories since they are all complementary and
> > should
> > > > > be
> > > > > > > maintained in one project. Prior to our submission, we will
> > combine
> > > > > all of
> > > > > > > these projects into a new git repository.
> > > > > > >
> > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > >
> > > > > > > Contributors to the DataSketches project have also signed the
> > Yahoo
> > > > > > > Individual Contributor License Agreement (
> > > > > https://yahoocla.herokuapp.com/
> > > > > > > in order to contribute to the project.
> > > > > > >
> > > > > > > With respect to trademark rights, Yahoo does not hold a
> > trademark on
> > > > > the
> > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> receive
> > > > > during the
> > > > > > > incubation process, we are open to renaming the project if
> > necessary
> > > > > for
> > > > > > > trademark or other concerns, but we would prefer not to have to
> > do
> > > > > that.
> > > > > > >
> > > > > > > == External Dependencies ==
> > > > > > >
> > > > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > > > Apache-compatible license. As we grow the DataSketches
> community
> > we
> > > > > will
> > > > > > > configure our build process to require and validate all
> > contributions
> > > > > and
> > > > > > > dependencies are licensed under the Apache 2.0 license or are
> > under
> > > > an
> > > > > > > Apache-compatible license.
> > > > > > >
> > > > > > > == Required Resources ==
> > > > > > >
> > > > > > > === Mailing Lists ===
> > > > > > >
> > > > > > > We currently use a mix of mailing lists. We will migrate our
> > existing
> > > > > > > mailing lists to the following:
> > > > > > >
> > > > > > > * dev@datasketches.incubator.apache.org
> > > > > > >
> > > > > > > * user@datasketches.incubator.apache.org
> > > > > > >
> > > > > > > * private@datasketches.incubator.apache.org
> > > > > > >
> > > > > > > * commits@datasketches.incubator.apache.org
> > > > > > >
> > > > > > > === Source Control ===
> > > > > > >
> > > > > > > The DataSketches team currently uses Git and would like to
> > continue
> > > > to
> > > > > do
> > > > > > > so. We request a Git repository for DataSketches with mirroring
> > to
> > > > > GitHub
> > > > > > > enabled similar the following:
> > > > > > >
> > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > >
> > > > > > > === Issue Tracking ===
> > > > > > >
> > > > > > > We request the creation of an Apache-hosted JIRA. The
> > DataSketches
> > > > > project
> > > > > > > is currently using the public GitHub issue tracker and the
> public
> > > > > Google
> > > > > > > Groups forum/sketches-user for issue tracking and discussions.
> We
> > > > will
> > > > > > > migrate and combine from these two sources to the Apache JIRA.
> > > > > > >
> > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > >
> > > > > > > == Initial Committers ==
> > > > > > >
> > > > > > > The following list of individuals have been extremely active in
> > our
> > > > > > > community and should have write (commit) permissions to the
> > > > repository.
> > > > > > >
> > > > > > > * Eshcar Hillel                      [eshcar at verizonmedia
> dot
> > com]
> > > > > > >
> > > > > > > * Kevin Lang                    [langk at verizonmedia dot com]
> > > > > > >
> > > > > > > * Roman Leventov              [roman.leventov at c.metamarkets
> > dot
> > > > com]
> > > > > > >
> > > > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > > > >
> > > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot
> com]
> > > > > > >
> > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> com] &
> > > > > [leerho
> > > > > > > at gmail dot com]
> > > > > > >
> > > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > > > > > >
> > > > > > > * Justin Thaler                 [justin.thaler at georgetown
> dot
> > edu]
> > > > > > >
> > > > > > > == Affiliations ==
> > > > > > >
> > > > > > > The initial committers are from four organizations: Yahoo,
> > Amazon,
> > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > >
> > > > > > > === Champion ===
> > > > > > > (Recommended to me: )
> > > > > > >
> > > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
> at
> > > > > apache
> > > > > > > dot org]
> > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > >
> > > > > > > === Nominated Mentors ===
> > > > > > > (Recommended to me: )
> > > > > > >
> > > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
> at
> > > > > apache
> > > > > > > dot org]
> > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > >
> > > > > > > === Sponsoring Entity ===
> > > > > > >
> > > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > > >
> > > > > > > * Apache Druid. The incubating Apache Druid project might also
> > be a
> > > > > logical
> > > > > > > sponsor. However, DataSketches has applications in many areas
> of
> > > > > computing
> > > > > > > outside of Druid so our preference and recommendation is that
> > > > > DataSketches
> > > > > > > would ultimately be a top-level Apache project.
> > > > > > >
> > > > > > > ________________
> > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> previously
> > > > > acquired
> > > > > > > AOL. The merged entity was originally called Oath, Inc., but
> has
> > > > > recently
> > > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> > > > Verizon,
> > > > > > > Inc.  Since Yahoo is the more recognized name, references in
> this
> > > > > document
> > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> kenn@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > The subject line has me interested already. Follow examples
> > like
> > > > this
> > > > > > > > maybe?
> > > > > > > >
> > > > > > > > 1.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > 2.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > >
> > > > > > > > Kenn
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > > I'll try again ... :)
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > ted.dunning@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> It didn't make it again
> > > > > > > > >>
> > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com>
> > wrote:
> > > > > > > > >>
> > > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> leerho@gmail.com>
> > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > general-unsubscribe@incubator.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > > general-help@incubator.apache.org
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > > >
> > > > >
> > > >
> > > --
> > > From my cell phone.
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: DataSketches Proposal - Google Docs Link

Posted by Kenneth Knowles <ke...@apache.org>.

I could not access that document. I suggest you need to turn on link
sharing.

Kenn

On Mon, Feb 25, 2019 at 12:00 PM leerho@gmail.com <le...@gmail.com> wrote:

> Try this link:
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
>
>
> On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote:
> > Yes I will try that tomorrow.
> >
> > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > > Can you share the Google doc with the proposal? Per Ted's advice, we
> can
> > > iterate quickly there and move it to the wiki when it becomes a bit
> more
> > > stable.
> > >
> > > Kenn
> > >
> > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the offer.  i am a neophyte at this process and email
> app!   I
> > > > could use a lot of help getting this off the ground!  Also, I'm not
> sure
> > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> > > >
> > > > Lee.
> > > >
> > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > > Nice.
> > > > >
> > > > > I would very much like to help mentor this project, though you
> already
> > > > have
> > > > > a couple good ones.
> > > > >
> > > > > I concur with incubator as sponsoring entity.
> > > > >
> > > > > Kenn (VP Apache Beam)
> > > > >
> > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> > > > >
> > > > > > I didn't realize that this mail list does not accept PDF files,
> > > > apparently
> > > > > > only text.  So let me try one more time ... :)  Please let me
> know if
> > > > > > this works!
> > > > > >
> > > > > >
> > > > > > = Apache DataSketches Proposal[1] =
> > > > > >
> > > > > > == Abstract ==
> > > > > >
> > > > > > DataSketches.GitHub.io is an open source, high-performance
> library
> > > of
> > > > > > stochastic streaming algorithms commonly called "sketches" in the
> > > data
> > > > > > sciences. Sketches are small, stateful programs that process
> massive
> > > > data
> > > > > > as a stream and can provide approximate answers, with
> mathematical
> > > > > > guarantees, to computationally difficult queries
> orders-of-magnitude
> > > > faster
> > > > > > than traditional, exact methods.
> > > > > >
> > > > > > This proposal is to move DataSketches to the Apache Software
> > > > > > Foundation(ASF) transferring ownership of its copyright
> intellectual
> > > > > > property to the ASF.  Thereafter, DataSketches would be
> officially
> > > > known as
> > > > > > Apache DataSketches and its evolution and governance would come
> under
> > > > the
> > > > > > rules and guidance of the ASF.
> > > > > >
> > > > > > == Introduction ==
> > > > > >
> > > > > > The DataSketches library contains carefully crafted
> implementations
> > > of
> > > > > > sketch algorithms that meet rigorous standards of quality and
> > > > performance
> > > > > > and provide capabilities required for large-scale production
> systems
> > > > that
> > > > > > must process and analyze massive data. The DataSketches core
> > > > repository is
> > > > > > written in Java with a parallel core repository written in C++
> that
> > > > > > includes Python wrappers. The DataSketches library also includes
> > > > special
> > > > > > repositories for extending the core library for Apache Hive and
> > > Apache
> > > > Pig.
> > > > > > The sketches developed in the different languages share a common
> > > binary
> > > > > > storage format so that sketches created and stored in Java, for
> > > > example,
> > > > > > can be fully used in C++, and visa versa.  Because the stored
> sketch
> > > > > > "images" are just a "blob" of bytes (similar to picture images),
> they
> > > > can
> > > > > > be shared across many different systems, languages and platforms.
> > > > > >
> > > > > > The DataSketches documentation website,
> > > https://datasketches.github.io
> > > > ,
> > > > > > includes general tutorials, a comprehensive research section with
> > > > > > references to relevant academic papers, extensive examples for
> using
> > > > the
> > > > > > core library directly as well as examples for accessing the
> library
> > > in
> > > > > > Hive, Pig, and Apache Spark.
> > > > > >
> > > > > > The DataSketches library also includes a characterization
> repository
> > > > for
> > > > > > long running test programs that are used for studying accuracy
> and
> > > > > > performance of these sketches over wide ranges of input
> variables.
> > > The
> > > > data
> > > > > > produced by these programs is used for generating the many
> > > performance
> > > > > > plots contained in the documentation website and for academic
> > > > > > publications.
> > > > > >
> > > > > > The code repositories used for production are versioned and
> published
> > > > to
> > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > >
> > > > > > The DataSketches library also includes several experimental
> > > > repositories
> > > > > > for use-cases outside the large-scale systems environments, such
> as
> > > > > > sketches for mobile, IoT devices (Android), command-line access
> of
> > > the
> > > > > > sketch library, and an experimental repository for vector-based
> > > > sketches
> > > > > > that performs approximate Singular Value Decomposition (SVD)
> analysis
> > > > that
> > > > > > could potentially be used in Machine Learning (ML) applications.
> > > > > >
> > > > > > == Background ==
> > > > > >
> > > > > > The DataSketches library was started in 2012 as internal Yahoo
> > > project
> > > > to
> > > > > > dramatically reduce time and resources required for distinct
> (unique)
> > > > > > counting.  An extensive search on the Internet at the time
> yielded a
> > > > number
> > > > > > of theoretical papers on stochastic streaming algorithms with
> > > > pseudocode
> > > > > > examples, but we did not find any usable open-source code of the
> > > > quality we
> > > > > > felt we needed for our internal production systems.  So we
> started a
> > > > small
> > > > > > project (one person) to develop our own sketches working directly
> > > from
> > > > > > published theoretical papers.
> > > > > >
> > > > > > The DataSketches library was designed from the start with the
> > > > objective of
> > > > > > making these algorithms, usually only described in theoretical
> > > papers,
> > > > > > easily accessible to systems developers for use in our internal
> > > > production
> > > > > > systems. By necessity, the code had to be of the highest quality
> and
> > > > > > thoroughly tested. The wide variety of our internal production
> > > systems
> > > > > > drove the requirement that the sketch implementations had to
> have an
> > > > > > absolute minimum of external, run-time dependencies in order to
> > > > simplify
> > > > > > integration and troubleshooting.
> > > > > >
> > > > > > Our internal experiments demonstrated dramatic positive impact
> on the
> > > > > > performance of our systems.  As a result, the DataSketches
> library
> > > > quickly
> > > > > > evolved to include different types of sketches for different
> types of
> > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> algorithms,
> > > > > > quantile/histogram algorithms, and weighted and unweighted
> sampling
> > > > > > algorithms.
> > > > > >
> > > > > > We quickly discovered that developing these sketch algorithms to
> be
> > > > truly
> > > > > > robust in production environments is quite difficult and requires
> > > deep
> > > > > > understanding of the underlying mathematics and statistics as
> well as
> > > > > > extensive experience in developing high quality code for 24/7
> > > > production
> > > > > > systems. This is a difficult combination of skills for any one
> > > > organization
> > > > > > to collect and maintain over time. It became clear that this
> > > technology
> > > > > > needed a community larger than Yahoo to evolve.  In November,
> 2015,
> > > > this
> > > > > > factor, along with Yahoo’s strong experience and support of open
> > > > source,
> > > > > > led to the decision to open source this technology under an
> Apache
> > > 2.0
> > > > > > license on GitHub. Since that time our community has expanded
> > > > considerably
> > > > > > and the key contributors to this effort includes leading research
> > > > > > scientists from a number of universities as well as
> practitioners and
> > > > > > researchers from a number of major corporations. The core of this
> > > > group is
> > > > > > very active as we meet weekly to discuss research directions and
> > > > > > engineering priorities.
> > > > > >
> > > > > > It is important to note that our internal systems at Yahoo use
> the
> > > > current
> > > > > > public GitHub open source DataSketches library and not an
> internal
> > > > version
> > > > > > of the code.
> > > > > >
> > > > > > The close collaboration of scientific research and engineering
> > > > development
> > > > > > experience with actual massive-data processing systems has also
> > > > produced
> > > > > > new research publications in the field of stochastic streaming
> > > > algorithms,
> > > > > > for example:
> > > > > >
> > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> > > > Rhodes, and
> > > > > > Justin Thaler. A high-performance algorithm for identifying
> frequent
> > > > items
> > > > > > in data streams. In ACM IMC 2017.
> > > > > >
> > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> Thaler. A
> > > > > > framework for estimating stream expression cardinalities. In
> > > *EDBT/ICDT
> > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > >
> > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> Frequent
> > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> Proceedings
> > > > ‘16,
> > > > > > pages 845-854, 2016.
> > > > > >
> > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> quantile
> > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> 71–78,
> > > > 2016.
> > > > > >
> > > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
> > > > cardinality
> > > > > > estimation algorithm. arXiv preprint
> > > https://arxiv.org/abs/1708.06839,
> > > > > > 2017.
> > > > > >
> > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM
> KDD
> > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > >
> > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> > > > Ullman.
> > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> > > > Proceedings
> > > > > > ‘16, pages 441–454, 2016.
> > > > > >
> > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > Hierarchical
> > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > > Proceedings
> > > > > > ‘12, pages 160–174, 2012.
> > > > > >
> > > > > > == The Rationale for Sketches ==
> > > > > >
> > > > > > In the analysis of big data there are often problem queries that
> > > don’t
> > > > > > scale because they require huge compute resources and time to
> > > generate
> > > > > > exact results. Examples include count distinct, quantiles, most
> > > > frequent
> > > > > > items, joins, matrix computations, and graph analysis.
> > > > > >
> > > > > > If we can loosen the requirement of “exact” results from our
> queries
> > > > and be
> > > > > > satisfied with approximate results, within some well understood
> > > bounds
> > > > of
> > > > > > error, there is an entire branch of mathematics and data science
> that
> > > > has
> > > > > > evolved around developing algorithms that can produce approximate
> > > > results
> > > > > > with mathematically well-defined error properties.
> > > > > >
> > > > > > With the additional requirements that these algorithms must be
> small
> > > > > > (compared to the size of the input data), sublinear (the size of
> the
> > > > sketch
> > > > > > must grow at a slower rate than the size of the input stream),
> > > > streaming
> > > > > > (they can only touch each data item once), and mergeable
> (suitable
> > > for
> > > > > > distributed processing), defines a class of algorithms that can
> be
> > > > > > described as small, stochastic, streaming, sublinear mergeable
> > > > algorithms,
> > > > > > commonly called sketches (they also have other names, but we
> will use
> > > > the
> > > > > > term sketches from here on).
> > > > > >
> > > > > > To be truly streaming and be able to process data in a single
> pass,
> > > > > > sketches must make absolute minimum assumptions about the input
> > > stream.
> > > > > > This is critically important, as there is no “second chance” to
> > > > process the
> > > > > > data.
> > > > > >
> > > > > > For example, sketches should not make assumptions about the
> order of
> > > > stream
> > > > > > items, the stream length, the dynamic range of values, or the
> > > > distribution
> > > > > > of item occurrence frequencies. Sketches should be tolerant of
> NaNs,
> > > > Nulls
> > > > > > and empty objects. About the only thing that the sketch needs to
> know
> > > > about
> > > > > > the stream is how to extract items from it and what type the
> item is,
> > > > e.g.,
> > > > > > is it a numeric value or a string.
> > > > > >
> > > > > > As far as the sketch is concerned, the input stream is a
> sequence of
> > > > items
> > > > > > in some unknown random order with unknown random values.
> > > > > >
> > > > > > The sketch is essentially a complex state machine and combined
> with
> > > the
> > > > > > random input stream defines a stochastic process. We then apply
> > > > > > probabilistic methods to interpret the states of the stochastic
> > > > process in
> > > > > > order to extract useful information about the input stream
> itself.
> > > The
> > > > > > resulting information will be approximate, but we also use
> additional
> > > > > > probabilistic methods to extract an estimate of the likely
> > > probability
> > > > > > distribution of error.
> > > > > >
> > > > > > There is a significant scientific contribution here that is
> defining
> > > > the
> > > > > > state machine, understanding the resulting stochastic process,
> > > > developing
> > > > > > the probabilistic methods, and proving mathematically, that it
> all
> > > > works!
> > > > > > This is why the scientific contributors to this project are a
> > > critical
> > > > and
> > > > > > strategic component to our success.  The development engineers
> > > > translate
> > > > > > the concepts of the proposed state machine and probabilistic
> methods
> > > > into
> > > > > > production-quality code. Even more important, they work closely
> with
> > > > the
> > > > > > scientists, feeding back system and user requirements, which
> leads
> > > not
> > > > only
> > > > > > to superior product design, but to new science as well.  A
> number of
> > > > > > scientific papers our members have published (see above) is a
> direct
> > > > result
> > > > > > of this close collaboration.
> > > > > >
> > > > > > Because sketches are small they can be processed extremely fast,
> > > often
> > > > many
> > > > > > orders-of-magnitude faster than traditional exact computations.
> For
> > > > > > interactive queries there may not be other viable alternatives,
> and
> > > in
> > > > the
> > > > > > case of real-time analysis, sketches are the only known solution.
> > > > > >
> > > > > > For any system that needs to extract useful information from
> massive
> > > > data
> > > > > > sketches are essential tools that should be tightly integrated
> into
> > > the
> > > > > > system’s analysis capabilities. This technology has helped Yahoo
> > > > > > successfully reduce data processing times from days to hours or
> > > > minutes on
> > > > > > a number of its internal platforms and has enabled subsecond
> queries
> > > on
> > > > > > real-time platforms that would have been infeasible without
> sketches.
> > > > > > The Rationale for Apache DataSketches
> > > > > > Other open source implementations of sketch algorithms can be
> found
> > > on
> > > > the
> > > > > > Internet. However, we have not yet found any open source
> > > > implementations
> > > > > > that are as comprehensive, engineered with the quality required
> for
> > > > > > production systems, and with usable and guaranteed error
> properties.
> > > > Large
> > > > > > Internet companies, such as Google and Facebook, have published
> > > papers
> > > > on
> > > > > > sketching, however, their implementations of their published
> > > > algorithms are
> > > > > > proprietary and not available as open source.
> > > > > >
> > > > > > The DataSketches library already provides integrations with a
> number
> > > of
> > > > > > major Apache data processing platforms such as Apache Hive,
> Apache
> > > Pig,
> > > > > > Apache Spark and Apache Druid, and is also integrated with a
> number
> > > of
> > > > > > other open source data processing platforms such as Splice
> Machine,
> > > > GCHQ
> > > > > > Gaffer and PostgreSQL.
> > > > > >
> > > > > > We believe that having DataSketches as an Apache project will
> provide
> > > > an
> > > > > > immediate, worthwhile, and substantial contribution to the open
> > > source
> > > > > > community, will have a better opportunity to provide a meaningful
> > > > > > contribution to both the science and engineering of sketching
> > > > algorithms,
> > > > > > and integrate with other Apache projects.  In addition, this is a
> > > > > > significant opportunity for Apache to be the "go-to" destination
> for
> > > > users
> > > > > > that want to leverage this exciting technology.
> > > > > >
> > > > > > == Initial Goals ==
> > > > > >
> > > > > > We are breaking our initial goals into short-term (2-6 months)
> and
> > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > >
> > > > > > Our short-term goals include:
> > > > > >
> > > > > > * Understanding and adapting to the Apache development process
> and
> > > > > > structures.
> > > > > >
> > > > > > * Start refactoring codebase and move various DataSketches
> > > repositories
> > > > > > code to Apache Git repository.
> > > > > >
> > > > > > * Continue development of new features, functions, and fixes.
> > > > > >
> > > > > > * Specific sub-projects (e.g., C++ and Python) will continue to
> be
> > > > > > developed and expanded.
> > > > > >
> > > > > >
> > > > > > The intermediate to long term goals include:
> > > > > >
> > > > > > * Completing the design and implementation of the C++ sketches to
> > > > > > complement what is already available in Java, and the Python
> wrappers
> > > > of
> > > > > > those C++ sketches.
> > > > > >
> > > > > > * Expanding the C++ build framework to include Windows and the
> > > popular
> > > > > > Linux variants.
> > > > > >
> > > > > > * Continued engagement with the scientific research community on
> the
> > > > > > development of new algorithms for computationally difficult
> problems
> > > > that
> > > > > > heretofore have not had a sketching solution.
> > > > > >
> > > > > > == Current Status ==
> > > > > >
> > > > > > The DataSketches GitHub project has been quite successful.  As of
> > > this
> > > > > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > > > > Repository Manager at https://oss.sonatype.org has grown by
> nearly a
> > > > > > factor
> > > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > > DataSketches/sketches-core repository has about 560 stars and 141
> > > > forks,
> > > > > > which is pretty good for a highly specialized library.
> > > > > >
> > > > > > === Development Practices ===
> > > > > >
> > > > > > ==== Source Control ====
> > > > > >
> > > > > > All of our developers have extensive experience with Git version
> > > > control
> > > > > > and follow accepted practices for use of Pull Requests (PRs),
> code
> > > > reviews
> > > > > > and commits to master, for example.
> > > > > >
> > > > > > ==== Testing ====
> > > > > >
> > > > > > Sketches, by their nature are probabilistic programs and don’t
> > > > necessarily
> > > > > > behave deterministically.  For some of the sketches we
> intentionally
> > > > insert
> > > > > > random noise into the code as this gives us the mathematical
> > > properties
> > > > > > that we need to guarantee accuracy.  This can make the behavior
> of
> > > > these
> > > > > > algorithms quite unintuitive and provides significant challenges
> to
> > > the
> > > > > > developer who wishes to test these algorithms for correctness.
> As a
> > > > result,
> > > > > > our testing strategy includes two major components: unit tests,
> and
> > > > > > characterization tests.
> > > > > >
> > > > > > ===== Unit Testing =====
> > > > > >
> > > > > > Our unit tests are primarily quick tests to make sure that we
> > > exercise
> > > > all
> > > > > > critical paths in the code and that key branches are executed
> > > > correctly. It
> > > > > > is important that they execute relatively fast as they are
> generally
> > > > run on
> > > > > > every code build. The sketches-core repository alone has about 22
> > > > thousand
> > > > > > statements, over 1300 unit tests and code coverage of about
> 98.2% as
> > > > > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > > > > repositories that are used in production that they have code
> coverage
> > > > > > greater than 90%.
> > > > > >
> > > > > > ===== Characterization Testing =====
> > > > > >
> > > > > > In order to test the probabilistic methods that are used to
> interpret
> > > > the
> > > > > > stochastic behaviors of our sketches we have a separate
> > > > characterization
> > > > > > repository that is dedicated to this.  To measure accuracy, for
> > > > example,
> > > > > > requires running thousands of trials at each of many different
> points
> > > > along
> > > > > > the domain axis. Each trial compares its estimated results
> against a
> > > > known
> > > > > > exact result producing an error for that trial.  These error
> > > > measurements
> > > > > > are then fed into our Quantiles sketch to capture the actual
> > > > distribution
> > > > > > of error at that point along the axis. We then select quantile
> > > contours
> > > > > > across all the distributions at points along the axis.  These
> > > contours
> > > > can
> > > > > > then be plotted to reveal the shape of the actual error
> distribution.
> > > > These
> > > > > > distributions are not at all Gaussian, in fact they can be quite
> > > > complex.
> > > > > > Nonetheless, these distributions are then checked against our
> > > > statistical
> > > > > > guarantees inherent to the specific sketch algorithm and its
> > > > parameters.
> > > > > > There are many examples of these characterization error
> distributions
> > > > on
> > > > > > our website. The runtimes of these tests can be very long and can
> > > range
> > > > > > from many minutes to hours, and some can run for days.
> Currently, we
> > > > have
> > > > > > separate characterization repositories for Java and C++ / Python.
> > > > > >
> > > > > > It is our goal that we perform this characterization analysis
> for all
> > > > of
> > > > > > our sketches.  By definition, the code that runs these
> > > characterization
> > > > > > tests is open-source so others can run these tests as well.  We
> do
> > > not
> > > > have
> > > > > > formal releases of this code (because it is not production code)
> and
> > > > it is
> > > > > > not published to Maven Central.
> > > > > >
> > > > > > === Meritocracy ===
> > > > > >
> > > > > > DataSketches was initially developed based on requirements within
> > > > Yahoo. As
> > > > > > a project on GitHub, DataSketches has received contributions from
> > > > numerous
> > > > > > individual developers from around the world, dedicated research
> work
> > > > from
> > > > > > senior scientists at Amazon and Visa, and academic researchers
> from
> > > > > > Georgetown University, Princeton, and MIT.
> > > > > >
> > > > > > As a project under incubation, we are committed to expanding our
> > > > effort to
> > > > > > build an environment which supports a meritocracy. We are
> focused on
> > > > > > engaging the community and other related projects for support and
> > > > > > contributions. Moreover, we are committed to ensure contributors
> and
> > > > > > committers to DataSketches come from a broad mix of organizations
> > > > through a
> > > > > > merit-based decision process during incubation. We believe
> strongly
> > > in
> > > > the
> > > > > > DataSketches premise that fulfills the concept of a well
> engineered
> > > and
> > > > > > scientifically rigorous library that implements these powerful
> > > > algorithms
> > > > > > and are committed to growing an inclusive community of
> DataSketches
> > > > > > contributors and users.
> > > > > >
> > > > > > === Community ===
> > > > > >
> > > > > > Yahoo has a long history and active engagement in the Open Source
> > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > > Panoptes,
> > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> > > > gifshot,
> > > > > > fluxible, as well as the creation, contribution and incubation of
> > > many
> > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> > > > Zookeeper,
> > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > >
> > > > > > Every day, DataSketches is actively used by a organizations and
> > > > > > institutions around the world for batch and stream processing of
> > > data.
> > > > We
> > > > > > believe acceptance will allow us to consolidate existing
> > > > > > DataSketches-related work, grow the DataSketches community, and
> > > deepen
> > > > > > connections between DataSketches and other open source projects.
> > > > > >
> > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > >
> > > > > > The core developers and contributors for DataSketches are from
> > > diverse
> > > > > > backgrounds, but primarily are scientists that love engineering
> and
> > > > > > engineers that love science. A large part of the value we bring
> comes
> > > > from
> > > > > > this synthesis.  These individuals have already contributed
> > > > substantially
> > > > > > to the code, algorithms, and/or mathematical proofs that form the
> > > > basis of
> > > > > > the library.
> > > > > >
> > > > > > This core group also form the Initial Committers with write
> > > > permissions to
> > > > > > the repository. Those marked with (*) Meet weekly to plan the
> > > research
> > > > and
> > > > > > engineering direction of the project.
> > > > > >
> > > > > > ==== Scientists That Love Engineering ====
> > > > > >
> > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> > > > Interests:
> > > > > > distributed systems, scalable systems and platforms for big data
> > > > > > processing, concurrent algorithms and data structures,
> > > > > >
> > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> > > > Sunnyvale,
> > > > > > California. Interests: algorithms, theoretical and applied
> > > mathematics,
> > > > > > encoding and compression theory, theoretical and applied
> performance
> > > > > > optimization.
> > > > > >
> > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs,
> Palo
> > > > Alto,
> > > > > > California. Manages the algorithms group at Amazon AI. We build
> > > > scalable
> > > > > > machine learning systems and algorithms which are used both
> > > internally
> > > > and
> > > > > > externally by customers of SageMaker, AWS's flagship machine
> learning
> > > > > > platform.
> > > > > >
> > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> Interests:
> > > > > > Computational advertising, machine learning, speech recognition,
> > > > > > data-driven analysis, large scale experimentation, big data,
> > > > stream/complex
> > > > > > event processing
> > > > > >
> > > > > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> > > > Science,
> > > > > > Georgetown University, Washington D.C. Interests: algorithms and
> > > > > > computational complexity, complexity theory, quantum algorithms,
> > > > private
> > > > > > data analysis, and learning theory, developing efficient
> streaming
> > > and
> > > > > > sketching algorithms
> > > > > >
> > > > > > ==== Engineers That Love Science ====
> > > > > >
> > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> > > > Interests:
> > > > > > design and implementation of data storing and data processing
> > > > (distributed)
> > > > > > systems, performance optimization, CPU performance, mechanical
> > > > sympathy,
> > > > > > JVM performance, API design, databases, (concurrent) data
> structures,
> > > > > > memory management, garbage collection algorithms, language
> design and
> > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> efficiency,
> > > > Linux,
> > > > > > code quality, code transformation, pure functional programming
> > > models,
> > > > > > Haskell.
> > > > > >
> > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> founder
> > > > of
> > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> Interests:
> > > > > > streaming algorithms, mathematics, computer science, high
> quality and
> > > > high
> > > > > > performance code for the analysis of massive data, bridging the
> > > divide
> > > > > > between theory and practice.
> > > > > >
> > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> Sunnyvale,
> > > > > > California. Interests: applied mathematics, computer science, big
> > > data,
> > > > > > distributed systems.
> > > > > >
> > > > > > === Introduction to Additional Interested Contributors ===
> > > > > >
> > > > > > These folks have been intermittently involved and contributed,
> but
> > > are
> > > > > > strong supporters of this project.
> > > > > >
> > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > >
> > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> > > > Science,
> > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > > > > approximation, streaming algorithms, randomized linear algebra.
> > > > > >
> > > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> > > > Computer
> > > > > > Science, Research Instructor, Princeton University. Interests:
> > > > algorithmic
> > > > > > foundations of data science and machine learning, efficient
> methods
> > > for
> > > > > > processing and understanding large datasets, often working at the
> > > > > > intersection of theoretical computer science, numerical linear
> > > > algebra, and
> > > > > > optimization.
> > > > > >
> > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> Computer
> > > > Science,
> > > > > > Professor, Warwick University, Warwick, England. Interests: all
> > > > aspects of
> > > > > > the "data lifecycle", from data collection and cleaning, through
> > > > mining and
> > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > scientists
> > > > in
> > > > > > sketching algorithms)
> > > > > >
> > > > > > === Alignment ===
> > > > > >
> > > > > > The DataSketches library already provides integrations and
> example
> > > > code for
> > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated
> into
> > > > Apache
> > > > > > Druid.
> > > > > >
> > > > > > == Known Risks ==
> > > > > >
> > > > > > The following subsections are specific risks that have been
> > > identified
> > > > by
> > > > > > the ASF that need to be addressed.
> > > > > >
> > > > > > === Risk: Orphaned Products ===
> > > > > >
> > > > > > The DataSketches library is presently used by a number of
> > > > organizations,
> > > > > > from small startups to Fortune 100 companies, to construct
> production
> > > > > > pipelines that must process and analyze massive data. Yahoo has a
> > > > long-term
> > > > > > commitment to continue to advance the DataSketches library;
> moreover,
> > > > > > DataSketches is seeing increasing interest, development, and
> adoption
> > > > from
> > > > > > many diverse organizations from around the world. Due to its
> growing
> > > > > > adoption, we feel it is quite unlikely that this project would
> become
> > > > > > orphaned.
> > > > > >
> > > > > > === Risk: Inexperience with Open Source ===
> > > > > >
> > > > > > Yahoo believes strongly in open source and the exchange of
> > > information
> > > > to
> > > > > > advance new ideas and work. Examples of this commitment are
> active
> > > open
> > > > > > source projects such as those mentioned above. With
> DataSketches, we
> > > > have
> > > > > > been increasingly open and forward-looking; we have published a
> > > number
> > > > of
> > > > > > papers about breakthrough developments in the science of
> streaming
> > > > > > algorithms (mentioned above) that also reference the DataSketches
> > > > library.
> > > > > > Our submission to the Apache Software Foundation is a logical
> > > > extension of
> > > > > > our commitment to open source software.
> > > > > >
> > > > > > Key committers at Yahoo with strong open source backgrounds
> include
> > > > Aaron
> > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> > > Andrews
> > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call,
> > > Daryn
> > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
> > > Hillel,
> > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > Perez-Sorrosal,
> > > > Gil
> > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> > > > Penick,
> > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> > > > Kihwal
> > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> Trelinski,
> > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > Natkovich,
> > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby
> Loo,
> > > > Ryan
> > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
> Chan,
> > > Sri
> > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > > > > >
> > > > > > All of our core developers are committed to learn about the
> Apache
> > > > process
> > > > > > and to give back to the community.
> > > > > >
> > > > > > === Risk: Homogeneous Developers ===
> > > > > >
> > > > > > The majority of committers in this proposal belong to Yahoo due
> to
> > > the
> > > > fact
> > > > > > that DataSketches has emerged from an internal Yahoo project.
> This
> > > > proposal
> > > > > > also includes developers and contributors from other companies,
> and
> > > > who are
> > > > > > actively involved with other Apache projects, such as Druid.  We
> > > > expect our
> > > > > > entry into incubation will allow us to expand the number of
> > > > individuals and
> > > > > > organizations participating in DataSketches development.
> > > > > >
> > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > >
> > > > > > Because the DataSketches library originated within Yahoo, it has
> been
> > > > > > developed primarily by salaried Yahoo developers and we expect
> that
> > > to
> > > > > > continue to be the case near term. However, since we placed this
> > > > library
> > > > > > into open-source we have had a number of significant
> contributions
> > > from
> > > > > > engineers and scientists from outside of Yahoo. We expect our
> > > reliance
> > > > on
> > > > > > Yahoo salaried developers will decrease over time. Nonetheless,
> Yahoo
> > > > is
> > > > > > committed to continue its strong support of this important
> project.
> > > > > >
> > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > >
> > > > > > DataSketches already directly interoperates with or utilizes
> several
> > > > > > existing Apache projects.
> > > > > >
> > > > > > * Build
> > > > > >    * Apache Maven
> > > > > >
> > > > > > * Integrations and adaptors for the following projects naturally
> have
> > > > them
> > > > > > as dependencies
> > > > > >    * Apache Hive
> > > > > >    * Apache Pig
> > > > > >    * Apache Druid
> > > > > >    * Apache Spark
> > > > > >
> > > > > > * Additional dependencies for the above integrations and adaptors
> > > > include
> > > > > >    * Apache Hadoop
> > > > > >    * Apache Commons (Math)
> > > > > >
> > > > > > There is no other Apache project that we are aware of that
> duplicates
> > > > the
> > > > > > functionality of the DataSketches library.
> > > > > >
> > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > > >
> > > > > > With this proposal we are not seeking attention or publicity.
> Rather,
> > > > we
> > > > > > firmly believe in the DataSketches library and concept and the
> > > ability
> > > > to
> > > > > > make the DataSketches library a powerful, yet simple-to-use
> toolkit
> > > for
> > > > > > data processing. While the DataSketches library has been open
> source,
> > > > we
> > > > > > believe putting code on GitHub can only go so far. We see the
> Apache
> > > > > > community, processes, and mission as critical for ensuring the
> > > > DataSketches
> > > > > > library is truly community-driven, positively impactful, and
> > > innovative
> > > > > > open source software. While Yahoo has taken a number of steps to
> > > > advance
> > > > > > its various open source projects, we believe the DataSketches
> library
> > > > > > project is a great fit for the Apache Software Foundation due to
> its
> > > > focus
> > > > > > on data processing and its relationships to existing ASF
> projects.
> > > > > >
> > > > > > === Risk: Cryptography ===
> > > > > >
> > > > > > DataSketches does not contain any cryptographic code and is not a
> > > > > > cryptographic product.
> > > > > >
> > > > > > == Documentation ==
> > > > > >
> > > > > > The following documentation is relevant to this proposal.
> Relevant
> > > > portions
> > > > > > of the documentation will be contributed to the Apache
> DataSketches
> > > > > > project.
> > > > > >
> > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > >
> > > > > > * DataSketches website repository:
> > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > >
> > > > > > We will need an apache website for this documentation similar to
> > > > > >
> > > > > > * https://datasketches.apache.org
> > > > > >
> > > > > > == Initial Source ==
> > > > > >
> > > > > > The initial source for DataSketches which we will submit to the
> > > Apache
> > > > > > Foundation will include a number of repositories which are
> currently
> > > > hosted
> > > > > > under the GitHub.com/datasketches organization:
> > > > > >
> > > > > > All github.com/datasketches repositories including:
> > > > > >
> > > > > > * Java
> > > > > >    * sketches-core: This repository has the core sketching
> classes,
> > > > which
> > > > > > are leveraged by some of the other repositories. This repository
> has
> > > no
> > > > > > external dependencies outside of the DataSketches/memory
> repository,
> > > > Java
> > > > > > and TestNG for unit tests. This code is versioned and the latest
> > > > release
> > > > > > can be obtained from Maven Central.
> > > > > >    * memory: Low level, high-performance memory data-structure
> > > > management
> > > > > > primarily for off-heap.
> > > > > >    * sketches-android: This is a new repository dedicated to
> sketches
> > > > > > designed to be run in a mobile client, such as a cell phone. It
> is
> > > > still in
> > > > > > development and should be considered experimental.
> > > > > >    * sketches-hive: This repository contains Hive UDFs and UDAFs
> for
> > > > use
> > > > > > within Hadoop grid environments. This code has dependencies on
> > > > > > sketches-core as well as Hadoop and Hive. Users of this code are
> > > > advised to
> > > > > > use Maven to bring in all the required dependencies. This code is
> > > > versioned
> > > > > > and the latest release can be obtained from Maven Central.
> > > > > >    * sketches-pig: This repository contains Pig User Defined
> > > Functions
> > > > > > (UDF) for use within Hadoop grid environments. This code has
> > > > dependencies
> > > > > > on sketches-core as well as Hadoop and Pig. Users of this code
> are
> > > > advised
> > > > > > to use Maven to bring in all the required dependencies. This
> code is
> > > > > > versioned and the latest release can be obtained from Maven
> Central.
> > > > > >    * sketches-vector: This is a new repository dedicated to
> sketches
> > > > for
> > > > > > vector and matrix operations. It is still somewhat experimental.
> > > > > >    * characterization: This relatively new repository is for code
> > > that
> > > > we
> > > > > > use to characterize the accuracy and speed performance of the
> > > sketches
> > > > in
> > > > > > the library and is constantly being updated. Examples of the job
> > > > command
> > > > > > files used for various tests can be found in the
> src/main/resources
> > > > > > directory. Some of these tests can run for hours depending on its
> > > > > > configuration.
> > > > > >    * experimental: This repository is an experimental staging
> area
> > > for
> > > > code
> > > > > > that will eventually end up in another repository. This code is
> not
> > > > > > versioned and not registered with Maven Central.
> > > > > >    * sketches-misc: Demos and other code not related to
> production
> > > > > > deployment
> > > > > >
> > > > > > * C++ and Python
> > > > > >    * sketches-core-cpp: This is the C++/Python companion to the
> Java
> > > > > > sketches-core. These implementations are binary compatible with
> their
> > > > > > counterparts in Java. In other words, a sketch created and
> stored in
> > > > C++
> > > > > > can be opened and read in Java and visa-versa. This site also
> has our
> > > > > > Python adaptors that basically wrap the C++ implementations,
> making
> > > the
> > > > > > high performance C++ implementations available from Python.
> > > > > >    * sketches-postgres: This site provides the postgres-specific
> > > > adaptors
> > > > > > that wrap the C++ implementations making them available to the
> > > Postgres
> > > > > > database users.
> > > > > >    * characterization-cpp: This is the C++/Python companion to
> the
> > > Java
> > > > > > characterization repository.
> > > > > >    * experimental-cpp: This repository is an experimental staging
> > > area
> > > > for
> > > > > > C++ code that will eventually end up in another repository.
> > > > > >
> > > > > > * Command-Line Tools
> > > > > >    * sketches-cmd
> > > > > >    * homebrew-sketches
> > > > > >    * homebrew-sketches-cmd
> > > > > >
> > > > > > These projects have always been Apache 2.0 licensed. We intend to
> > > > bundle
> > > > > > all of these repositories since they are all complementary and
> should
> > > > be
> > > > > > maintained in one project. Prior to our submission, we will
> combine
> > > > all of
> > > > > > these projects into a new git repository.
> > > > > >
> > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > >
> > > > > > Contributors to the DataSketches project have also signed the
> Yahoo
> > > > > > Individual Contributor License Agreement (
> > > > https://yahoocla.herokuapp.com/
> > > > > > in order to contribute to the project.
> > > > > >
> > > > > > With respect to trademark rights, Yahoo does not hold a
> trademark on
> > > > the
> > > > > > phrase “DataSketches.” Based on feedback and guidance we receive
> > > > during the
> > > > > > incubation process, we are open to renaming the project if
> necessary
> > > > for
> > > > > > trademark or other concerns, but we would prefer not to have to
> do
> > > > that.
> > > > > >
> > > > > > == External Dependencies ==
> > > > > >
> > > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > > Apache-compatible license. As we grow the DataSketches community
> we
> > > > will
> > > > > > configure our build process to require and validate all
> contributions
> > > > and
> > > > > > dependencies are licensed under the Apache 2.0 license or are
> under
> > > an
> > > > > > Apache-compatible license.
> > > > > >
> > > > > > == Required Resources ==
> > > > > >
> > > > > > === Mailing Lists ===
> > > > > >
> > > > > > We currently use a mix of mailing lists. We will migrate our
> existing
> > > > > > mailing lists to the following:
> > > > > >
> > > > > > * dev@datasketches.incubator.apache.org
> > > > > >
> > > > > > * user@datasketches.incubator.apache.org
> > > > > >
> > > > > > * private@datasketches.incubator.apache.org
> > > > > >
> > > > > > * commits@datasketches.incubator.apache.org
> > > > > >
> > > > > > === Source Control ===
> > > > > >
> > > > > > The DataSketches team currently uses Git and would like to
> continue
> > > to
> > > > do
> > > > > > so. We request a Git repository for DataSketches with mirroring
> to
> > > > GitHub
> > > > > > enabled similar the following:
> > > > > >
> > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > >
> > > > > > === Issue Tracking ===
> > > > > >
> > > > > > We request the creation of an Apache-hosted JIRA. The
> DataSketches
> > > > project
> > > > > > is currently using the public GitHub issue tracker and the public
> > > > Google
> > > > > > Groups forum/sketches-user for issue tracking and discussions. We
> > > will
> > > > > > migrate and combine from these two sources to the Apache JIRA.
> > > > > >
> > > > > > Proposed Jira ID: DATASKETCHES
> > > > > >
> > > > > > == Initial Committers ==
> > > > > >
> > > > > > The following list of individuals have been extremely active in
> our
> > > > > > community and should have write (commit) permissions to the
> > > repository.
> > > > > >
> > > > > > * Eshcar Hillel                      [eshcar at verizonmedia dot
> com]
> > > > > >
> > > > > > * Kevin Lang                    [langk at verizonmedia dot com]
> > > > > >
> > > > > > * Roman Leventov              [roman.leventov at c.metamarkets
> dot
> > > com]
> > > > > >
> > > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > > >
> > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > > > > >
> > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> > > > [leerho
> > > > > > at gmail dot com]
> > > > > >
> > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > > > > >
> > > > > > * Justin Thaler                 [justin.thaler at georgetown dot
> edu]
> > > > > >
> > > > > > == Affiliations ==
> > > > > >
> > > > > > The initial committers are from four organizations: Yahoo,
> Amazon,
> > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > >
> > > > > > === Champion ===
> > > > > > (Recommended to me: )
> > > > > >
> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > > apache
> > > > > > dot org]
> > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > >
> > > > > > === Nominated Mentors ===
> > > > > > (Recommended to me: )
> > > > > >
> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > > apache
> > > > > > dot org]
> > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > >
> > > > > > === Sponsoring Entity ===
> > > > > >
> > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > >
> > > > > > * Apache Druid. The incubating Apache Druid project might also
> be a
> > > > logical
> > > > > > sponsor. However, DataSketches has applications in many areas of
> > > > computing
> > > > > > outside of Druid so our preference and recommendation is that
> > > > DataSketches
> > > > > > would ultimately be a top-level Apache project.
> > > > > >
> > > > > > ________________
> > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> > > > acquired
> > > > > > AOL. The merged entity was originally called Oath, Inc., but has
> > > > recently
> > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> > > Verizon,
> > > > > > Inc.  Since Yahoo is the more recognized name, references in this
> > > > document
> > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <kenn@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > > The subject line has me interested already. Follow examples
> like
> > > this
> > > > > > > maybe?
> > > > > > >
> > > > > > > 1.
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > 2.
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > >
> > > > > > > Kenn
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > I'll try again ... :)
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > ted.dunning@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> It didn't make it again
> > > > > > > >>
> > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com>
> wrote:
> > > > > > > >>
> > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > >> >
> > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com>
> > > > wrote:
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> general-unsubscribe@incubator.apache.org
> > > > > > > > For additional commands, e-mail:
> > > general-help@incubator.apache.org
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > > >
> > >
> > --
> > From my cell phone.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

DataSketches Proposal - Google Docs Link

Posted by le...@gmail.com, le...@gmail.com.

Try this link: https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing


On 2019/02/25 05:55:50, leerho <le...@gmail.com> wrote: 
> Yes I will try that tomorrow.
> 
> On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org> wrote:
> 
> > Can you share the Google doc with the proposal? Per Ted's advice, we can
> > iterate quickly there and move it to the wiki when it becomes a bit more
> > stable.
> >
> > Kenn
> >
> > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com>
> > wrote:
> >
> > > Thanks for the offer.  i am a neophyte at this process and email app!   I
> > > could use a lot of help getting this off the ground!  Also, I'm not sure
> > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> > >
> > > Lee.
> > >
> > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > Nice.
> > > >
> > > > I would very much like to help mentor this project, though you already
> > > have
> > > > a couple good ones.
> > > >
> > > > I concur with incubator as sponsoring entity.
> > > >
> > > > Kenn (VP Apache Beam)
> > > >
> > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> > > >
> > > > > I didn't realize that this mail list does not accept PDF files,
> > > apparently
> > > > > only text.  So let me try one more time ... :)  Please let me know if
> > > > > this works!
> > > > >
> > > > >
> > > > > = Apache DataSketches Proposal[1] =
> > > > >
> > > > > == Abstract ==
> > > > >
> > > > > DataSketches.GitHub.io is an open source, high-performance library
> > of
> > > > > stochastic streaming algorithms commonly called "sketches" in the
> > data
> > > > > sciences. Sketches are small, stateful programs that process massive
> > > data
> > > > > as a stream and can provide approximate answers, with mathematical
> > > > > guarantees, to computationally difficult queries orders-of-magnitude
> > > faster
> > > > > than traditional, exact methods.
> > > > >
> > > > > This proposal is to move DataSketches to the Apache Software
> > > > > Foundation(ASF) transferring ownership of its copyright intellectual
> > > > > property to the ASF.  Thereafter, DataSketches would be officially
> > > known as
> > > > > Apache DataSketches and its evolution and governance would come under
> > > the
> > > > > rules and guidance of the ASF.
> > > > >
> > > > > == Introduction ==
> > > > >
> > > > > The DataSketches library contains carefully crafted implementations
> > of
> > > > > sketch algorithms that meet rigorous standards of quality and
> > > performance
> > > > > and provide capabilities required for large-scale production systems
> > > that
> > > > > must process and analyze massive data. The DataSketches core
> > > repository is
> > > > > written in Java with a parallel core repository written in C++ that
> > > > > includes Python wrappers. The DataSketches library also includes
> > > special
> > > > > repositories for extending the core library for Apache Hive and
> > Apache
> > > Pig.
> > > > > The sketches developed in the different languages share a common
> > binary
> > > > > storage format so that sketches created and stored in Java, for
> > > example,
> > > > > can be fully used in C++, and visa versa.  Because the stored sketch
> > > > > "images" are just a "blob" of bytes (similar to picture images), they
> > > can
> > > > > be shared across many different systems, languages and platforms.
> > > > >
> > > > > The DataSketches documentation website,
> > https://datasketches.github.io
> > > ,
> > > > > includes general tutorials, a comprehensive research section with
> > > > > references to relevant academic papers, extensive examples for using
> > > the
> > > > > core library directly as well as examples for accessing the library
> > in
> > > > > Hive, Pig, and Apache Spark.
> > > > >
> > > > > The DataSketches library also includes a characterization repository
> > > for
> > > > > long running test programs that are used for studying accuracy and
> > > > > performance of these sketches over wide ranges of input variables.
> > The
> > > data
> > > > > produced by these programs is used for generating the many
> > performance
> > > > > plots contained in the documentation website and for academic
> > > > > publications.
> > > > >
> > > > > The code repositories used for production are versioned and published
> > > to
> > > > > Maven Central on periodic intervals as the library evolves.
> > > > >
> > > > > The DataSketches library also includes several experimental
> > > repositories
> > > > > for use-cases outside the large-scale systems environments, such as
> > > > > sketches for mobile, IoT devices (Android), command-line access of
> > the
> > > > > sketch library, and an experimental repository for vector-based
> > > sketches
> > > > > that performs approximate Singular Value Decomposition (SVD) analysis
> > > that
> > > > > could potentially be used in Machine Learning (ML) applications.
> > > > >
> > > > > == Background ==
> > > > >
> > > > > The DataSketches library was started in 2012 as internal Yahoo
> > project
> > > to
> > > > > dramatically reduce time and resources required for distinct (unique)
> > > > > counting.  An extensive search on the Internet at the time yielded a
> > > number
> > > > > of theoretical papers on stochastic streaming algorithms with
> > > pseudocode
> > > > > examples, but we did not find any usable open-source code of the
> > > quality we
> > > > > felt we needed for our internal production systems.  So we started a
> > > small
> > > > > project (one person) to develop our own sketches working directly
> > from
> > > > > published theoretical papers.
> > > > >
> > > > > The DataSketches library was designed from the start with the
> > > objective of
> > > > > making these algorithms, usually only described in theoretical
> > papers,
> > > > > easily accessible to systems developers for use in our internal
> > > production
> > > > > systems. By necessity, the code had to be of the highest quality and
> > > > > thoroughly tested. The wide variety of our internal production
> > systems
> > > > > drove the requirement that the sketch implementations had to have an
> > > > > absolute minimum of external, run-time dependencies in order to
> > > simplify
> > > > > integration and troubleshooting.
> > > > >
> > > > > Our internal experiments demonstrated dramatic positive impact on the
> > > > > performance of our systems.  As a result, the DataSketches library
> > > quickly
> > > > > evolved to include different types of sketches for different types of
> > > > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > > > > quantile/histogram algorithms, and weighted and unweighted sampling
> > > > > algorithms.
> > > > >
> > > > > We quickly discovered that developing these sketch algorithms to be
> > > truly
> > > > > robust in production environments is quite difficult and requires
> > deep
> > > > > understanding of the underlying mathematics and statistics as well as
> > > > > extensive experience in developing high quality code for 24/7
> > > production
> > > > > systems. This is a difficult combination of skills for any one
> > > organization
> > > > > to collect and maintain over time. It became clear that this
> > technology
> > > > > needed a community larger than Yahoo to evolve.  In November, 2015,
> > > this
> > > > > factor, along with Yahoo’s strong experience and support of open
> > > source,
> > > > > led to the decision to open source this technology under an Apache
> > 2.0
> > > > > license on GitHub. Since that time our community has expanded
> > > considerably
> > > > > and the key contributors to this effort includes leading research
> > > > > scientists from a number of universities as well as practitioners and
> > > > > researchers from a number of major corporations. The core of this
> > > group is
> > > > > very active as we meet weekly to discuss research directions and
> > > > > engineering priorities.
> > > > >
> > > > > It is important to note that our internal systems at Yahoo use the
> > > current
> > > > > public GitHub open source DataSketches library and not an internal
> > > version
> > > > > of the code.
> > > > >
> > > > > The close collaboration of scientific research and engineering
> > > development
> > > > > experience with actual massive-data processing systems has also
> > > produced
> > > > > new research publications in the field of stochastic streaming
> > > algorithms,
> > > > > for example:
> > > > >
> > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> > > Rhodes, and
> > > > > Justin Thaler. A high-performance algorithm for identifying frequent
> > > items
> > > > > in data streams. In ACM IMC 2017.
> > > > >
> > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > > > > framework for estimating stream expression cardinalities. In
> > *EDBT/ICDT
> > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > >
> > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings
> > > ‘16,
> > > > > pages 845-854, 2016.
> > > > >
> > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78,
> > > 2016.
> > > > >
> > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
> > > cardinality
> > > > > estimation algorithm. arXiv preprint
> > https://arxiv.org/abs/1708.06839,
> > > > > 2017.
> > > > >
> > > > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
> > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > >
> > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> > > Ullman.
> > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> > > Proceedings
> > > > > ‘16, pages 441–454, 2016.
> > > > >
> > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > Hierarchical
> > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > Proceedings
> > > > > ‘12, pages 160–174, 2012.
> > > > >
> > > > > == The Rationale for Sketches ==
> > > > >
> > > > > In the analysis of big data there are often problem queries that
> > don’t
> > > > > scale because they require huge compute resources and time to
> > generate
> > > > > exact results. Examples include count distinct, quantiles, most
> > > frequent
> > > > > items, joins, matrix computations, and graph analysis.
> > > > >
> > > > > If we can loosen the requirement of “exact” results from our queries
> > > and be
> > > > > satisfied with approximate results, within some well understood
> > bounds
> > > of
> > > > > error, there is an entire branch of mathematics and data science that
> > > has
> > > > > evolved around developing algorithms that can produce approximate
> > > results
> > > > > with mathematically well-defined error properties.
> > > > >
> > > > > With the additional requirements that these algorithms must be small
> > > > > (compared to the size of the input data), sublinear (the size of the
> > > sketch
> > > > > must grow at a slower rate than the size of the input stream),
> > > streaming
> > > > > (they can only touch each data item once), and mergeable (suitable
> > for
> > > > > distributed processing), defines a class of algorithms that can be
> > > > > described as small, stochastic, streaming, sublinear mergeable
> > > algorithms,
> > > > > commonly called sketches (they also have other names, but we will use
> > > the
> > > > > term sketches from here on).
> > > > >
> > > > > To be truly streaming and be able to process data in a single pass,
> > > > > sketches must make absolute minimum assumptions about the input
> > stream.
> > > > > This is critically important, as there is no “second chance” to
> > > process the
> > > > > data.
> > > > >
> > > > > For example, sketches should not make assumptions about the order of
> > > stream
> > > > > items, the stream length, the dynamic range of values, or the
> > > distribution
> > > > > of item occurrence frequencies. Sketches should be tolerant of NaNs,
> > > Nulls
> > > > > and empty objects. About the only thing that the sketch needs to know
> > > about
> > > > > the stream is how to extract items from it and what type the item is,
> > > e.g.,
> > > > > is it a numeric value or a string.
> > > > >
> > > > > As far as the sketch is concerned, the input stream is a sequence of
> > > items
> > > > > in some unknown random order with unknown random values.
> > > > >
> > > > > The sketch is essentially a complex state machine and combined with
> > the
> > > > > random input stream defines a stochastic process. We then apply
> > > > > probabilistic methods to interpret the states of the stochastic
> > > process in
> > > > > order to extract useful information about the input stream itself.
> > The
> > > > > resulting information will be approximate, but we also use additional
> > > > > probabilistic methods to extract an estimate of the likely
> > probability
> > > > > distribution of error.
> > > > >
> > > > > There is a significant scientific contribution here that is defining
> > > the
> > > > > state machine, understanding the resulting stochastic process,
> > > developing
> > > > > the probabilistic methods, and proving mathematically, that it all
> > > works!
> > > > > This is why the scientific contributors to this project are a
> > critical
> > > and
> > > > > strategic component to our success.  The development engineers
> > > translate
> > > > > the concepts of the proposed state machine and probabilistic methods
> > > into
> > > > > production-quality code. Even more important, they work closely with
> > > the
> > > > > scientists, feeding back system and user requirements, which leads
> > not
> > > only
> > > > > to superior product design, but to new science as well.  A number of
> > > > > scientific papers our members have published (see above) is a direct
> > > result
> > > > > of this close collaboration.
> > > > >
> > > > > Because sketches are small they can be processed extremely fast,
> > often
> > > many
> > > > > orders-of-magnitude faster than traditional exact computations. For
> > > > > interactive queries there may not be other viable alternatives, and
> > in
> > > the
> > > > > case of real-time analysis, sketches are the only known solution.
> > > > >
> > > > > For any system that needs to extract useful information from massive
> > > data
> > > > > sketches are essential tools that should be tightly integrated into
> > the
> > > > > system’s analysis capabilities. This technology has helped Yahoo
> > > > > successfully reduce data processing times from days to hours or
> > > minutes on
> > > > > a number of its internal platforms and has enabled subsecond queries
> > on
> > > > > real-time platforms that would have been infeasible without sketches.
> > > > > The Rationale for Apache DataSketches
> > > > > Other open source implementations of sketch algorithms can be found
> > on
> > > the
> > > > > Internet. However, we have not yet found any open source
> > > implementations
> > > > > that are as comprehensive, engineered with the quality required for
> > > > > production systems, and with usable and guaranteed error properties.
> > > Large
> > > > > Internet companies, such as Google and Facebook, have published
> > papers
> > > on
> > > > > sketching, however, their implementations of their published
> > > algorithms are
> > > > > proprietary and not available as open source.
> > > > >
> > > > > The DataSketches library already provides integrations with a number
> > of
> > > > > major Apache data processing platforms such as Apache Hive, Apache
> > Pig,
> > > > > Apache Spark and Apache Druid, and is also integrated with a number
> > of
> > > > > other open source data processing platforms such as Splice Machine,
> > > GCHQ
> > > > > Gaffer and PostgreSQL.
> > > > >
> > > > > We believe that having DataSketches as an Apache project will provide
> > > an
> > > > > immediate, worthwhile, and substantial contribution to the open
> > source
> > > > > community, will have a better opportunity to provide a meaningful
> > > > > contribution to both the science and engineering of sketching
> > > algorithms,
> > > > > and integrate with other Apache projects.  In addition, this is a
> > > > > significant opportunity for Apache to be the "go-to" destination for
> > > users
> > > > > that want to leverage this exciting technology.
> > > > >
> > > > > == Initial Goals ==
> > > > >
> > > > > We are breaking our initial goals into short-term (2-6 months) and
> > > > > intermediate to long-term ( 6 months to 2 years):
> > > > >
> > > > > Our short-term goals include:
> > > > >
> > > > > * Understanding and adapting to the Apache development process and
> > > > > structures.
> > > > >
> > > > > * Start refactoring codebase and move various DataSketches
> > repositories
> > > > > code to Apache Git repository.
> > > > >
> > > > > * Continue development of new features, functions, and fixes.
> > > > >
> > > > > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > > > > developed and expanded.
> > > > >
> > > > >
> > > > > The intermediate to long term goals include:
> > > > >
> > > > > * Completing the design and implementation of the C++ sketches to
> > > > > complement what is already available in Java, and the Python wrappers
> > > of
> > > > > those C++ sketches.
> > > > >
> > > > > * Expanding the C++ build framework to include Windows and the
> > popular
> > > > > Linux variants.
> > > > >
> > > > > * Continued engagement with the scientific research community on the
> > > > > development of new algorithms for computationally difficult problems
> > > that
> > > > > heretofore have not had a sketching solution.
> > > > >
> > > > > == Current Status ==
> > > > >
> > > > > The DataSketches GitHub project has been quite successful.  As of
> > this
> > > > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > > > Repository Manager at https://oss.sonatype.org has grown by nearly a
> > > > > factor
> > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > DataSketches/sketches-core repository has about 560 stars and 141
> > > forks,
> > > > > which is pretty good for a highly specialized library.
> > > > >
> > > > > === Development Practices ===
> > > > >
> > > > > ==== Source Control ====
> > > > >
> > > > > All of our developers have extensive experience with Git version
> > > control
> > > > > and follow accepted practices for use of Pull Requests (PRs), code
> > > reviews
> > > > > and commits to master, for example.
> > > > >
> > > > > ==== Testing ====
> > > > >
> > > > > Sketches, by their nature are probabilistic programs and don’t
> > > necessarily
> > > > > behave deterministically.  For some of the sketches we intentionally
> > > insert
> > > > > random noise into the code as this gives us the mathematical
> > properties
> > > > > that we need to guarantee accuracy.  This can make the behavior of
> > > these
> > > > > algorithms quite unintuitive and provides significant challenges to
> > the
> > > > > developer who wishes to test these algorithms for correctness. As a
> > > result,
> > > > > our testing strategy includes two major components: unit tests, and
> > > > > characterization tests.
> > > > >
> > > > > ===== Unit Testing =====
> > > > >
> > > > > Our unit tests are primarily quick tests to make sure that we
> > exercise
> > > all
> > > > > critical paths in the code and that key branches are executed
> > > correctly. It
> > > > > is important that they execute relatively fast as they are generally
> > > run on
> > > > > every code build. The sketches-core repository alone has about 22
> > > thousand
> > > > > statements, over 1300 unit tests and code coverage of about 98.2% as
> > > > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > > > repositories that are used in production that they have code coverage
> > > > > greater than 90%.
> > > > >
> > > > > ===== Characterization Testing =====
> > > > >
> > > > > In order to test the probabilistic methods that are used to interpret
> > > the
> > > > > stochastic behaviors of our sketches we have a separate
> > > characterization
> > > > > repository that is dedicated to this.  To measure accuracy, for
> > > example,
> > > > > requires running thousands of trials at each of many different points
> > > along
> > > > > the domain axis. Each trial compares its estimated results against a
> > > known
> > > > > exact result producing an error for that trial.  These error
> > > measurements
> > > > > are then fed into our Quantiles sketch to capture the actual
> > > distribution
> > > > > of error at that point along the axis. We then select quantile
> > contours
> > > > > across all the distributions at points along the axis.  These
> > contours
> > > can
> > > > > then be plotted to reveal the shape of the actual error distribution.
> > > These
> > > > > distributions are not at all Gaussian, in fact they can be quite
> > > complex.
> > > > > Nonetheless, these distributions are then checked against our
> > > statistical
> > > > > guarantees inherent to the specific sketch algorithm and its
> > > parameters.
> > > > > There are many examples of these characterization error distributions
> > > on
> > > > > our website. The runtimes of these tests can be very long and can
> > range
> > > > > from many minutes to hours, and some can run for days.  Currently, we
> > > have
> > > > > separate characterization repositories for Java and C++ / Python.
> > > > >
> > > > > It is our goal that we perform this characterization analysis for all
> > > of
> > > > > our sketches.  By definition, the code that runs these
> > characterization
> > > > > tests is open-source so others can run these tests as well.  We do
> > not
> > > have
> > > > > formal releases of this code (because it is not production code) and
> > > it is
> > > > > not published to Maven Central.
> > > > >
> > > > > === Meritocracy ===
> > > > >
> > > > > DataSketches was initially developed based on requirements within
> > > Yahoo. As
> > > > > a project on GitHub, DataSketches has received contributions from
> > > numerous
> > > > > individual developers from around the world, dedicated research work
> > > from
> > > > > senior scientists at Amazon and Visa, and academic researchers from
> > > > > Georgetown University, Princeton, and MIT.
> > > > >
> > > > > As a project under incubation, we are committed to expanding our
> > > effort to
> > > > > build an environment which supports a meritocracy. We are focused on
> > > > > engaging the community and other related projects for support and
> > > > > contributions. Moreover, we are committed to ensure contributors and
> > > > > committers to DataSketches come from a broad mix of organizations
> > > through a
> > > > > merit-based decision process during incubation. We believe strongly
> > in
> > > the
> > > > > DataSketches premise that fulfills the concept of a well engineered
> > and
> > > > > scientifically rigorous library that implements these powerful
> > > algorithms
> > > > > and are committed to growing an inclusive community of DataSketches
> > > > > contributors and users.
> > > > >
> > > > > === Community ===
> > > > >
> > > > > Yahoo has a long history and active engagement in the Open Source
> > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > Panoptes,
> > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> > > gifshot,
> > > > > fluxible, as well as the creation, contribution and incubation of
> > many
> > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> > > Zookeeper,
> > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > >
> > > > > Every day, DataSketches is actively used by a organizations and
> > > > > institutions around the world for batch and stream processing of
> > data.
> > > We
> > > > > believe acceptance will allow us to consolidate existing
> > > > > DataSketches-related work, grow the DataSketches community, and
> > deepen
> > > > > connections between DataSketches and other open source projects.
> > > > >
> > > > > === Introduction to the Core Developers & Contributors ===
> > > > >
> > > > > The core developers and contributors for DataSketches are from
> > diverse
> > > > > backgrounds, but primarily are scientists that love engineering and
> > > > > engineers that love science. A large part of the value we bring comes
> > > from
> > > > > this synthesis.  These individuals have already contributed
> > > substantially
> > > > > to the code, algorithms, and/or mathematical proofs that form the
> > > basis of
> > > > > the library.
> > > > >
> > > > > This core group also form the Initial Committers with write
> > > permissions to
> > > > > the repository. Those marked with (*) Meet weekly to plan the
> > research
> > > and
> > > > > engineering direction of the project.
> > > > >
> > > > > ==== Scientists That Love Engineering ====
> > > > >
> > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> > > Interests:
> > > > > distributed systems, scalable systems and platforms for big data
> > > > > processing, concurrent algorithms and data structures,
> > > > >
> > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> > > Sunnyvale,
> > > > > California. Interests: algorithms, theoretical and applied
> > mathematics,
> > > > > encoding and compression theory, theoretical and applied performance
> > > > > optimization.
> > > > >
> > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo
> > > Alto,
> > > > > California. Manages the algorithms group at Amazon AI. We build
> > > scalable
> > > > > machine learning systems and algorithms which are used both
> > internally
> > > and
> > > > > externally by customers of SageMaker, AWS's flagship machine learning
> > > > > platform.
> > > > >
> > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
> > > > > Computational advertising, machine learning, speech recognition,
> > > > > data-driven analysis, large scale experimentation, big data,
> > > stream/complex
> > > > > event processing
> > > > >
> > > > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> > > Science,
> > > > > Georgetown University, Washington D.C. Interests: algorithms and
> > > > > computational complexity, complexity theory, quantum algorithms,
> > > private
> > > > > data analysis, and learning theory, developing efficient streaming
> > and
> > > > > sketching algorithms
> > > > >
> > > > > ==== Engineers That Love Science ====
> > > > >
> > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> > > Interests:
> > > > > design and implementation of data storing and data processing
> > > (distributed)
> > > > > systems, performance optimization, CPU performance, mechanical
> > > sympathy,
> > > > > JVM performance, API design, databases, (concurrent) data structures,
> > > > > memory management, garbage collection algorithms, language design and
> > > > > runtimes (their tradeoffs), distributed systems (cloud) efficiency,
> > > Linux,
> > > > > code quality, code transformation, pure functional programming
> > models,
> > > > > Haskell.
> > > > >
> > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and founder
> > > of
> > > > > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > > > > streaming algorithms, mathematics, computer science, high quality and
> > > high
> > > > > performance code for the analysis of massive data, bridging the
> > divide
> > > > > between theory and practice.
> > > > >
> > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
> > > > > California. Interests: applied mathematics, computer science, big
> > data,
> > > > > distributed systems.
> > > > >
> > > > > === Introduction to Additional Interested Contributors ===
> > > > >
> > > > > These folks have been intermittently involved and contributed, but
> > are
> > > > > strong supporters of this project.
> > > > >
> > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > >
> > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> > > Science,
> > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > > > approximation, streaming algorithms, randomized linear algebra.
> > > > >
> > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> > > Computer
> > > > > Science, Research Instructor, Princeton University. Interests:
> > > algorithmic
> > > > > foundations of data science and machine learning, efficient methods
> > for
> > > > > processing and understanding large datasets, often working at the
> > > > > intersection of theoretical computer science, numerical linear
> > > algebra, and
> > > > > optimization.
> > > > >
> > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer
> > > Science,
> > > > > Professor, Warwick University, Warwick, England. Interests: all
> > > aspects of
> > > > > the "data lifecycle", from data collection and cleaning, through
> > > mining and
> > > > > analytics. (Professor Cormode is one of the world’s leading
> > scientists
> > > in
> > > > > sketching algorithms)
> > > > >
> > > > > === Alignment ===
> > > > >
> > > > > The DataSketches library already provides integrations and example
> > > code for
> > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into
> > > Apache
> > > > > Druid.
> > > > >
> > > > > == Known Risks ==
> > > > >
> > > > > The following subsections are specific risks that have been
> > identified
> > > by
> > > > > the ASF that need to be addressed.
> > > > >
> > > > > === Risk: Orphaned Products ===
> > > > >
> > > > > The DataSketches library is presently used by a number of
> > > organizations,
> > > > > from small startups to Fortune 100 companies, to construct production
> > > > > pipelines that must process and analyze massive data. Yahoo has a
> > > long-term
> > > > > commitment to continue to advance the DataSketches library; moreover,
> > > > > DataSketches is seeing increasing interest, development, and adoption
> > > from
> > > > > many diverse organizations from around the world. Due to its growing
> > > > > adoption, we feel it is quite unlikely that this project would become
> > > > > orphaned.
> > > > >
> > > > > === Risk: Inexperience with Open Source ===
> > > > >
> > > > > Yahoo believes strongly in open source and the exchange of
> > information
> > > to
> > > > > advance new ideas and work. Examples of this commitment are active
> > open
> > > > > source projects such as those mentioned above. With DataSketches, we
> > > have
> > > > > been increasingly open and forward-looking; we have published a
> > number
> > > of
> > > > > papers about breakthrough developments in the science of streaming
> > > > > algorithms (mentioned above) that also reference the DataSketches
> > > library.
> > > > > Our submission to the Apache Software Foundation is a logical
> > > extension of
> > > > > our commitment to open source software.
> > > > >
> > > > > Key committers at Yahoo with strong open source backgrounds include
> > > Aaron
> > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> > Andrews
> > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call,
> > Daryn
> > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
> > Hillel,
> > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > Perez-Sorrosal,
> > > Gil
> > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> > > Penick,
> > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> > > Kihwal
> > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
> > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > Natkovich,
> > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo,
> > > Ryan
> > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan,
> > Sri
> > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > > > >
> > > > > All of our core developers are committed to learn about the Apache
> > > process
> > > > > and to give back to the community.
> > > > >
> > > > > === Risk: Homogeneous Developers ===
> > > > >
> > > > > The majority of committers in this proposal belong to Yahoo due to
> > the
> > > fact
> > > > > that DataSketches has emerged from an internal Yahoo project. This
> > > proposal
> > > > > also includes developers and contributors from other companies, and
> > > who are
> > > > > actively involved with other Apache projects, such as Druid.  We
> > > expect our
> > > > > entry into incubation will allow us to expand the number of
> > > individuals and
> > > > > organizations participating in DataSketches development.
> > > > >
> > > > > === Risk: Reliance on Salaried Developers ===
> > > > >
> > > > > Because the DataSketches library originated within Yahoo, it has been
> > > > > developed primarily by salaried Yahoo developers and we expect that
> > to
> > > > > continue to be the case near term. However, since we placed this
> > > library
> > > > > into open-source we have had a number of significant contributions
> > from
> > > > > engineers and scientists from outside of Yahoo. We expect our
> > reliance
> > > on
> > > > > Yahoo salaried developers will decrease over time. Nonetheless, Yahoo
> > > is
> > > > > committed to continue its strong support of this important project.
> > > > >
> > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > >
> > > > > DataSketches already directly interoperates with or utilizes several
> > > > > existing Apache projects.
> > > > >
> > > > > * Build
> > > > >    * Apache Maven
> > > > >
> > > > > * Integrations and adaptors for the following projects naturally have
> > > them
> > > > > as dependencies
> > > > >    * Apache Hive
> > > > >    * Apache Pig
> > > > >    * Apache Druid
> > > > >    * Apache Spark
> > > > >
> > > > > * Additional dependencies for the above integrations and adaptors
> > > include
> > > > >    * Apache Hadoop
> > > > >    * Apache Commons (Math)
> > > > >
> > > > > There is no other Apache project that we are aware of that duplicates
> > > the
> > > > > functionality of the DataSketches library.
> > > > >
> > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > >
> > > > > With this proposal we are not seeking attention or publicity. Rather,
> > > we
> > > > > firmly believe in the DataSketches library and concept and the
> > ability
> > > to
> > > > > make the DataSketches library a powerful, yet simple-to-use toolkit
> > for
> > > > > data processing. While the DataSketches library has been open source,
> > > we
> > > > > believe putting code on GitHub can only go so far. We see the Apache
> > > > > community, processes, and mission as critical for ensuring the
> > > DataSketches
> > > > > library is truly community-driven, positively impactful, and
> > innovative
> > > > > open source software. While Yahoo has taken a number of steps to
> > > advance
> > > > > its various open source projects, we believe the DataSketches library
> > > > > project is a great fit for the Apache Software Foundation due to its
> > > focus
> > > > > on data processing and its relationships to existing ASF projects.
> > > > >
> > > > > === Risk: Cryptography ===
> > > > >
> > > > > DataSketches does not contain any cryptographic code and is not a
> > > > > cryptographic product.
> > > > >
> > > > > == Documentation ==
> > > > >
> > > > > The following documentation is relevant to this proposal. Relevant
> > > portions
> > > > > of the documentation will be contributed to the Apache DataSketches
> > > > > project.
> > > > >
> > > > > * DataSketches website: https://datasketches.github.io.
> > > > >
> > > > > * DataSketches website repository:
> > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > >
> > > > > We will need an apache website for this documentation similar to
> > > > >
> > > > > * https://datasketches.apache.org
> > > > >
> > > > > == Initial Source ==
> > > > >
> > > > > The initial source for DataSketches which we will submit to the
> > Apache
> > > > > Foundation will include a number of repositories which are currently
> > > hosted
> > > > > under the GitHub.com/datasketches organization:
> > > > >
> > > > > All github.com/datasketches repositories including:
> > > > >
> > > > > * Java
> > > > >    * sketches-core: This repository has the core sketching classes,
> > > which
> > > > > are leveraged by some of the other repositories. This repository has
> > no
> > > > > external dependencies outside of the DataSketches/memory repository,
> > > Java
> > > > > and TestNG for unit tests. This code is versioned and the latest
> > > release
> > > > > can be obtained from Maven Central.
> > > > >    * memory: Low level, high-performance memory data-structure
> > > management
> > > > > primarily for off-heap.
> > > > >    * sketches-android: This is a new repository dedicated to sketches
> > > > > designed to be run in a mobile client, such as a cell phone. It is
> > > still in
> > > > > development and should be considered experimental.
> > > > >    * sketches-hive: This repository contains Hive UDFs and UDAFs for
> > > use
> > > > > within Hadoop grid environments. This code has dependencies on
> > > > > sketches-core as well as Hadoop and Hive. Users of this code are
> > > advised to
> > > > > use Maven to bring in all the required dependencies. This code is
> > > versioned
> > > > > and the latest release can be obtained from Maven Central.
> > > > >    * sketches-pig: This repository contains Pig User Defined
> > Functions
> > > > > (UDF) for use within Hadoop grid environments. This code has
> > > dependencies
> > > > > on sketches-core as well as Hadoop and Pig. Users of this code are
> > > advised
> > > > > to use Maven to bring in all the required dependencies. This code is
> > > > > versioned and the latest release can be obtained from Maven Central.
> > > > >    * sketches-vector: This is a new repository dedicated to sketches
> > > for
> > > > > vector and matrix operations. It is still somewhat experimental.
> > > > >    * characterization: This relatively new repository is for code
> > that
> > > we
> > > > > use to characterize the accuracy and speed performance of the
> > sketches
> > > in
> > > > > the library and is constantly being updated. Examples of the job
> > > command
> > > > > files used for various tests can be found in the src/main/resources
> > > > > directory. Some of these tests can run for hours depending on its
> > > > > configuration.
> > > > >    * experimental: This repository is an experimental staging area
> > for
> > > code
> > > > > that will eventually end up in another repository. This code is not
> > > > > versioned and not registered with Maven Central.
> > > > >    * sketches-misc: Demos and other code not related to production
> > > > > deployment
> > > > >
> > > > > * C++ and Python
> > > > >    * sketches-core-cpp: This is the C++/Python companion to the Java
> > > > > sketches-core. These implementations are binary compatible with their
> > > > > counterparts in Java. In other words, a sketch created and stored in
> > > C++
> > > > > can be opened and read in Java and visa-versa. This site also has our
> > > > > Python adaptors that basically wrap the C++ implementations, making
> > the
> > > > > high performance C++ implementations available from Python.
> > > > >    * sketches-postgres: This site provides the postgres-specific
> > > adaptors
> > > > > that wrap the C++ implementations making them available to the
> > Postgres
> > > > > database users.
> > > > >    * characterization-cpp: This is the C++/Python companion to the
> > Java
> > > > > characterization repository.
> > > > >    * experimental-cpp: This repository is an experimental staging
> > area
> > > for
> > > > > C++ code that will eventually end up in another repository.
> > > > >
> > > > > * Command-Line Tools
> > > > >    * sketches-cmd
> > > > >    * homebrew-sketches
> > > > >    * homebrew-sketches-cmd
> > > > >
> > > > > These projects have always been Apache 2.0 licensed. We intend to
> > > bundle
> > > > > all of these repositories since they are all complementary and should
> > > be
> > > > > maintained in one project. Prior to our submission, we will combine
> > > all of
> > > > > these projects into a new git repository.
> > > > >
> > > > > == Source and Intellectual Property Submission Plan ==
> > > > >
> > > > > Contributors to the DataSketches project have also signed the Yahoo
> > > > > Individual Contributor License Agreement (
> > > https://yahoocla.herokuapp.com/
> > > > > in order to contribute to the project.
> > > > >
> > > > > With respect to trademark rights, Yahoo does not hold a trademark on
> > > the
> > > > > phrase “DataSketches.” Based on feedback and guidance we receive
> > > during the
> > > > > incubation process, we are open to renaming the project if necessary
> > > for
> > > > > trademark or other concerns, but we would prefer not to have to do
> > > that.
> > > > >
> > > > > == External Dependencies ==
> > > > >
> > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > Apache-compatible license. As we grow the DataSketches community we
> > > will
> > > > > configure our build process to require and validate all contributions
> > > and
> > > > > dependencies are licensed under the Apache 2.0 license or are under
> > an
> > > > > Apache-compatible license.
> > > > >
> > > > > == Required Resources ==
> > > > >
> > > > > === Mailing Lists ===
> > > > >
> > > > > We currently use a mix of mailing lists. We will migrate our existing
> > > > > mailing lists to the following:
> > > > >
> > > > > * dev@datasketches.incubator.apache.org
> > > > >
> > > > > * user@datasketches.incubator.apache.org
> > > > >
> > > > > * private@datasketches.incubator.apache.org
> > > > >
> > > > > * commits@datasketches.incubator.apache.org
> > > > >
> > > > > === Source Control ===
> > > > >
> > > > > The DataSketches team currently uses Git and would like to continue
> > to
> > > do
> > > > > so. We request a Git repository for DataSketches with mirroring to
> > > GitHub
> > > > > enabled similar the following:
> > > > >
> > > > > * https://github.com/apache/incubator-datasketches.git
> > > > >
> > > > > === Issue Tracking ===
> > > > >
> > > > > We request the creation of an Apache-hosted JIRA. The DataSketches
> > > project
> > > > > is currently using the public GitHub issue tracker and the public
> > > Google
> > > > > Groups forum/sketches-user for issue tracking and discussions. We
> > will
> > > > > migrate and combine from these two sources to the Apache JIRA.
> > > > >
> > > > > Proposed Jira ID: DATASKETCHES
> > > > >
> > > > > == Initial Committers ==
> > > > >
> > > > > The following list of individuals have been extremely active in our
> > > > > community and should have write (commit) permissions to the
> > repository.
> > > > >
> > > > > * Eshcar Hillel                      [eshcar at verizonmedia dot com]
> > > > >
> > > > > * Kevin Lang                    [langk at verizonmedia dot com]
> > > > >
> > > > > * Roman Leventov              [roman.leventov at c.metamarkets dot
> > com]
> > > > >
> > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > >
> > > > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > > > >
> > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> > > [leerho
> > > > > at gmail dot com]
> > > > >
> > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > > > >
> > > > > * Justin Thaler                 [justin.thaler at georgetown dot edu]
> > > > >
> > > > > == Affiliations ==
> > > > >
> > > > > The initial committers are from four organizations: Yahoo, Amazon,
> > > > > Georgetown University, and Metamarkets/Snap.
> > > > >
> > > > > === Champion ===
> > > > > (Recommended to me: )
> > > > >
> > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > apache
> > > > > dot org]
> > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > >
> > > > > === Nominated Mentors ===
> > > > > (Recommended to me: )
> > > > >
> > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > apache
> > > > > dot org]
> > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > >
> > > > > === Sponsoring Entity ===
> > > > >
> > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > >
> > > > > * Apache Druid. The incubating Apache Druid project might also be a
> > > logical
> > > > > sponsor. However, DataSketches has applications in many areas of
> > > computing
> > > > > outside of Druid so our preference and recommendation is that
> > > DataSketches
> > > > > would ultimately be a top-level Apache project.
> > > > >
> > > > > ________________
> > > > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> > > acquired
> > > > > AOL. The merged entity was originally called Oath, Inc., but has
> > > recently
> > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> > Verizon,
> > > > > Inc.  Since Yahoo is the more recognized name, references in this
> > > document
> > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > >
> > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > >
> > > > > > The subject line has me interested already. Follow examples like
> > this
> > > > > > maybe?
> > > > > >
> > > > > > 1.
> > > > > >
> > > > > >
> > > > >
> > >
> > https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > 2.
> > > > > >
> > > > > >
> > > > >
> > >
> > https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > >
> > > > > > Kenn
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com> wrote:
> > > > > >
> > > > > > > I'll try again ... :)
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > ted.dunning@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> It didn't make it again
> > > > > > >>
> > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com> wrote:
> > > > > > >>
> > > > > > >> > I'm not sure the attached document made it through.
> > > > > > >> >
> > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > > > For additional commands, e-mail:
> > general-help@incubator.apache.org
> > > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
> -- 
> From my cell phone.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

DataSketches Proposal WRT Commons-Math

Posted by le...@gmail.com, le...@gmail.com.

As your suggestion may come up again, I have addressed it by adding a small section to the proposal itself (in the Google Doc).

On 2019/02/25 17:36:31, Ted Dunning <te...@gmail.com> wrote: 
> There is also the general question of whether it is better to be a
> top-level project or to become a contribution to commons math.
> 
> 
> 
> On Sun, Feb 24, 2019 at 10:56 PM leerho <le...@gmail.com> wrote:
> 
> > Yes I will try that tomorrow.
> >
> > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > > Can you share the Google doc with the proposal? Per Ted's advice, we can
> > > iterate quickly there and move it to the wiki when it becomes a bit more
> > > stable.
> > >
> > > Kenn
> > >
> > > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the offer.  i am a neophyte at this process and email app!
> >  I
> > > > could use a lot of help getting this off the ground!  Also, I'm not
> > sure
> > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> > > >
> > > > Lee.
> > > >
> > > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > > Nice.
> > > > >
> > > > > I would very much like to help mentor this project, though you
> > already
> > > > have
> > > > > a couple good ones.
> > > > >
> > > > > I concur with incubator as sponsoring entity.
> > > > >
> > > > > Kenn (VP Apache Beam)
> > > > >
> > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> > > > >
> > > > > > I didn't realize that this mail list does not accept PDF files,
> > > > apparently
> > > > > > only text.  So let me try one more time ... :)  Please let me know
> > if
> > > > > > this works!
> > > > > >
> > > > > >
> > > > > > = Apache DataSketches Proposal[1] =
> > > > > >
> > > > > > == Abstract ==
> > > > > >
> > > > > > DataSketches.GitHub.io is an open source, high-performance library
> > > of
> > > > > > stochastic streaming algorithms commonly called "sketches" in the
> > > data
> > > > > > sciences. Sketches are small, stateful programs that process
> > massive
> > > > data
> > > > > > as a stream and can provide approximate answers, with mathematical
> > > > > > guarantees, to computationally difficult queries
> > orders-of-magnitude
> > > > faster
> > > > > > than traditional, exact methods.
> > > > > >
> > > > > > This proposal is to move DataSketches to the Apache Software
> > > > > > Foundation(ASF) transferring ownership of its copyright
> > intellectual
> > > > > > property to the ASF.  Thereafter, DataSketches would be officially
> > > > known as
> > > > > > Apache DataSketches and its evolution and governance would come
> > under
> > > > the
> > > > > > rules and guidance of the ASF.
> > > > > >
> > > > > > == Introduction ==
> > > > > >
> > > > > > The DataSketches library contains carefully crafted implementations
> > > of
> > > > > > sketch algorithms that meet rigorous standards of quality and
> > > > performance
> > > > > > and provide capabilities required for large-scale production
> > systems
> > > > that
> > > > > > must process and analyze massive data. The DataSketches core
> > > > repository is
> > > > > > written in Java with a parallel core repository written in C++ that
> > > > > > includes Python wrappers. The DataSketches library also includes
> > > > special
> > > > > > repositories for extending the core library for Apache Hive and
> > > Apache
> > > > Pig.
> > > > > > The sketches developed in the different languages share a common
> > > binary
> > > > > > storage format so that sketches created and stored in Java, for
> > > > example,
> > > > > > can be fully used in C++, and visa versa.  Because the stored
> > sketch
> > > > > > "images" are just a "blob" of bytes (similar to picture images),
> > they
> > > > can
> > > > > > be shared across many different systems, languages and platforms.
> > > > > >
> > > > > > The DataSketches documentation website,
> > > https://datasketches.github.io
> > > > ,
> > > > > > includes general tutorials, a comprehensive research section with
> > > > > > references to relevant academic papers, extensive examples for
> > using
> > > > the
> > > > > > core library directly as well as examples for accessing the library
> > > in
> > > > > > Hive, Pig, and Apache Spark.
> > > > > >
> > > > > > The DataSketches library also includes a characterization
> > repository
> > > > for
> > > > > > long running test programs that are used for studying accuracy and
> > > > > > performance of these sketches over wide ranges of input variables.
> > > The
> > > > data
> > > > > > produced by these programs is used for generating the many
> > > performance
> > > > > > plots contained in the documentation website and for academic
> > > > > > publications.
> > > > > >
> > > > > > The code repositories used for production are versioned and
> > published
> > > > to
> > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > >
> > > > > > The DataSketches library also includes several experimental
> > > > repositories
> > > > > > for use-cases outside the large-scale systems environments, such as
> > > > > > sketches for mobile, IoT devices (Android), command-line access of
> > > the
> > > > > > sketch library, and an experimental repository for vector-based
> > > > sketches
> > > > > > that performs approximate Singular Value Decomposition (SVD)
> > analysis
> > > > that
> > > > > > could potentially be used in Machine Learning (ML) applications.
> > > > > >
> > > > > > == Background ==
> > > > > >
> > > > > > The DataSketches library was started in 2012 as internal Yahoo
> > > project
> > > > to
> > > > > > dramatically reduce time and resources required for distinct
> > (unique)
> > > > > > counting.  An extensive search on the Internet at the time yielded
> > a
> > > > number
> > > > > > of theoretical papers on stochastic streaming algorithms with
> > > > pseudocode
> > > > > > examples, but we did not find any usable open-source code of the
> > > > quality we
> > > > > > felt we needed for our internal production systems.  So we started
> > a
> > > > small
> > > > > > project (one person) to develop our own sketches working directly
> > > from
> > > > > > published theoretical papers.
> > > > > >
> > > > > > The DataSketches library was designed from the start with the
> > > > objective of
> > > > > > making these algorithms, usually only described in theoretical
> > > papers,
> > > > > > easily accessible to systems developers for use in our internal
> > > > production
> > > > > > systems. By necessity, the code had to be of the highest quality
> > and
> > > > > > thoroughly tested. The wide variety of our internal production
> > > systems
> > > > > > drove the requirement that the sketch implementations had to have
> > an
> > > > > > absolute minimum of external, run-time dependencies in order to
> > > > simplify
> > > > > > integration and troubleshooting.
> > > > > >
> > > > > > Our internal experiments demonstrated dramatic positive impact on
> > the
> > > > > > performance of our systems.  As a result, the DataSketches library
> > > > quickly
> > > > > > evolved to include different types of sketches for different types
> > of
> > > > > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > > > > > quantile/histogram algorithms, and weighted and unweighted sampling
> > > > > > algorithms.
> > > > > >
> > > > > > We quickly discovered that developing these sketch algorithms to be
> > > > truly
> > > > > > robust in production environments is quite difficult and requires
> > > deep
> > > > > > understanding of the underlying mathematics and statistics as well
> > as
> > > > > > extensive experience in developing high quality code for 24/7
> > > > production
> > > > > > systems. This is a difficult combination of skills for any one
> > > > organization
> > > > > > to collect and maintain over time. It became clear that this
> > > technology
> > > > > > needed a community larger than Yahoo to evolve.  In November, 2015,
> > > > this
> > > > > > factor, along with Yahoo’s strong experience and support of open
> > > > source,
> > > > > > led to the decision to open source this technology under an Apache
> > > 2.0
> > > > > > license on GitHub. Since that time our community has expanded
> > > > considerably
> > > > > > and the key contributors to this effort includes leading research
> > > > > > scientists from a number of universities as well as practitioners
> > and
> > > > > > researchers from a number of major corporations. The core of this
> > > > group is
> > > > > > very active as we meet weekly to discuss research directions and
> > > > > > engineering priorities.
> > > > > >
> > > > > > It is important to note that our internal systems at Yahoo use the
> > > > current
> > > > > > public GitHub open source DataSketches library and not an internal
> > > > version
> > > > > > of the code.
> > > > > >
> > > > > > The close collaboration of scientific research and engineering
> > > > development
> > > > > > experience with actual massive-data processing systems has also
> > > > produced
> > > > > > new research publications in the field of stochastic streaming
> > > > algorithms,
> > > > > > for example:
> > > > > >
> > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> > > > Rhodes, and
> > > > > > Justin Thaler. A high-performance algorithm for identifying
> > frequent
> > > > items
> > > > > > in data streams. In ACM IMC 2017.
> > > > > >
> > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > > > > > framework for estimating stream expression cardinalities. In
> > > *EDBT/ICDT
> > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > >
> > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings
> > > > ‘16,
> > > > > > pages 845-854, 2016.
> > > > > >
> > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> > 71–78,
> > > > 2016.
> > > > > >
> > > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
> > > > cardinality
> > > > > > estimation algorithm. arXiv preprint
> > > https://arxiv.org/abs/1708.06839,
> > > > > > 2017.
> > > > > >
> > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM
> > KDD
> > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > >
> > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> > > > Ullman.
> > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> > > > Proceedings
> > > > > > ‘16, pages 441–454, 2016.
> > > > > >
> > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > Hierarchical
> > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > > Proceedings
> > > > > > ‘12, pages 160–174, 2012.
> > > > > >
> > > > > > == The Rationale for Sketches ==
> > > > > >
> > > > > > In the analysis of big data there are often problem queries that
> > > don’t
> > > > > > scale because they require huge compute resources and time to
> > > generate
> > > > > > exact results. Examples include count distinct, quantiles, most
> > > > frequent
> > > > > > items, joins, matrix computations, and graph analysis.
> > > > > >
> > > > > > If we can loosen the requirement of “exact” results from our
> > queries
> > > > and be
> > > > > > satisfied with approximate results, within some well understood
> > > bounds
> > > > of
> > > > > > error, there is an entire branch of mathematics and data science
> > that
> > > > has
> > > > > > evolved around developing algorithms that can produce approximate
> > > > results
> > > > > > with mathematically well-defined error properties.
> > > > > >
> > > > > > With the additional requirements that these algorithms must be
> > small
> > > > > > (compared to the size of the input data), sublinear (the size of
> > the
> > > > sketch
> > > > > > must grow at a slower rate than the size of the input stream),
> > > > streaming
> > > > > > (they can only touch each data item once), and mergeable (suitable
> > > for
> > > > > > distributed processing), defines a class of algorithms that can be
> > > > > > described as small, stochastic, streaming, sublinear mergeable
> > > > algorithms,
> > > > > > commonly called sketches (they also have other names, but we will
> > use
> > > > the
> > > > > > term sketches from here on).
> > > > > >
> > > > > > To be truly streaming and be able to process data in a single pass,
> > > > > > sketches must make absolute minimum assumptions about the input
> > > stream.
> > > > > > This is critically important, as there is no “second chance” to
> > > > process the
> > > > > > data.
> > > > > >
> > > > > > For example, sketches should not make assumptions about the order
> > of
> > > > stream
> > > > > > items, the stream length, the dynamic range of values, or the
> > > > distribution
> > > > > > of item occurrence frequencies. Sketches should be tolerant of
> > NaNs,
> > > > Nulls
> > > > > > and empty objects. About the only thing that the sketch needs to
> > know
> > > > about
> > > > > > the stream is how to extract items from it and what type the item
> > is,
> > > > e.g.,
> > > > > > is it a numeric value or a string.
> > > > > >
> > > > > > As far as the sketch is concerned, the input stream is a sequence
> > of
> > > > items
> > > > > > in some unknown random order with unknown random values.
> > > > > >
> > > > > > The sketch is essentially a complex state machine and combined with
> > > the
> > > > > > random input stream defines a stochastic process. We then apply
> > > > > > probabilistic methods to interpret the states of the stochastic
> > > > process in
> > > > > > order to extract useful information about the input stream itself.
> > > The
> > > > > > resulting information will be approximate, but we also use
> > additional
> > > > > > probabilistic methods to extract an estimate of the likely
> > > probability
> > > > > > distribution of error.
> > > > > >
> > > > > > There is a significant scientific contribution here that is
> > defining
> > > > the
> > > > > > state machine, understanding the resulting stochastic process,
> > > > developing
> > > > > > the probabilistic methods, and proving mathematically, that it all
> > > > works!
> > > > > > This is why the scientific contributors to this project are a
> > > critical
> > > > and
> > > > > > strategic component to our success.  The development engineers
> > > > translate
> > > > > > the concepts of the proposed state machine and probabilistic
> > methods
> > > > into
> > > > > > production-quality code. Even more important, they work closely
> > with
> > > > the
> > > > > > scientists, feeding back system and user requirements, which leads
> > > not
> > > > only
> > > > > > to superior product design, but to new science as well.  A number
> > of
> > > > > > scientific papers our members have published (see above) is a
> > direct
> > > > result
> > > > > > of this close collaboration.
> > > > > >
> > > > > > Because sketches are small they can be processed extremely fast,
> > > often
> > > > many
> > > > > > orders-of-magnitude faster than traditional exact computations. For
> > > > > > interactive queries there may not be other viable alternatives, and
> > > in
> > > > the
> > > > > > case of real-time analysis, sketches are the only known solution.
> > > > > >
> > > > > > For any system that needs to extract useful information from
> > massive
> > > > data
> > > > > > sketches are essential tools that should be tightly integrated into
> > > the
> > > > > > system’s analysis capabilities. This technology has helped Yahoo
> > > > > > successfully reduce data processing times from days to hours or
> > > > minutes on
> > > > > > a number of its internal platforms and has enabled subsecond
> > queries
> > > on
> > > > > > real-time platforms that would have been infeasible without
> > sketches.
> > > > > > The Rationale for Apache DataSketches
> > > > > > Other open source implementations of sketch algorithms can be found
> > > on
> > > > the
> > > > > > Internet. However, we have not yet found any open source
> > > > implementations
> > > > > > that are as comprehensive, engineered with the quality required for
> > > > > > production systems, and with usable and guaranteed error
> > properties.
> > > > Large
> > > > > > Internet companies, such as Google and Facebook, have published
> > > papers
> > > > on
> > > > > > sketching, however, their implementations of their published
> > > > algorithms are
> > > > > > proprietary and not available as open source.
> > > > > >
> > > > > > The DataSketches library already provides integrations with a
> > number
> > > of
> > > > > > major Apache data processing platforms such as Apache Hive, Apache
> > > Pig,
> > > > > > Apache Spark and Apache Druid, and is also integrated with a number
> > > of
> > > > > > other open source data processing platforms such as Splice Machine,
> > > > GCHQ
> > > > > > Gaffer and PostgreSQL.
> > > > > >
> > > > > > We believe that having DataSketches as an Apache project will
> > provide
> > > > an
> > > > > > immediate, worthwhile, and substantial contribution to the open
> > > source
> > > > > > community, will have a better opportunity to provide a meaningful
> > > > > > contribution to both the science and engineering of sketching
> > > > algorithms,
> > > > > > and integrate with other Apache projects.  In addition, this is a
> > > > > > significant opportunity for Apache to be the "go-to" destination
> > for
> > > > users
> > > > > > that want to leverage this exciting technology.
> > > > > >
> > > > > > == Initial Goals ==
> > > > > >
> > > > > > We are breaking our initial goals into short-term (2-6 months) and
> > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > >
> > > > > > Our short-term goals include:
> > > > > >
> > > > > > * Understanding and adapting to the Apache development process and
> > > > > > structures.
> > > > > >
> > > > > > * Start refactoring codebase and move various DataSketches
> > > repositories
> > > > > > code to Apache Git repository.
> > > > > >
> > > > > > * Continue development of new features, functions, and fixes.
> > > > > >
> > > > > > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > > > > > developed and expanded.
> > > > > >
> > > > > >
> > > > > > The intermediate to long term goals include:
> > > > > >
> > > > > > * Completing the design and implementation of the C++ sketches to
> > > > > > complement what is already available in Java, and the Python
> > wrappers
> > > > of
> > > > > > those C++ sketches.
> > > > > >
> > > > > > * Expanding the C++ build framework to include Windows and the
> > > popular
> > > > > > Linux variants.
> > > > > >
> > > > > > * Continued engagement with the scientific research community on
> > the
> > > > > > development of new algorithms for computationally difficult
> > problems
> > > > that
> > > > > > heretofore have not had a sketching solution.
> > > > > >
> > > > > > == Current Status ==
> > > > > >
> > > > > > The DataSketches GitHub project has been quite successful.  As of
> > > this
> > > > > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > > > > Repository Manager at https://oss.sonatype.org has grown by
> > nearly a
> > > > > > factor
> > > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > > DataSketches/sketches-core repository has about 560 stars and 141
> > > > forks,
> > > > > > which is pretty good for a highly specialized library.
> > > > > >
> > > > > > === Development Practices ===
> > > > > >
> > > > > > ==== Source Control ====
> > > > > >
> > > > > > All of our developers have extensive experience with Git version
> > > > control
> > > > > > and follow accepted practices for use of Pull Requests (PRs), code
> > > > reviews
> > > > > > and commits to master, for example.
> > > > > >
> > > > > > ==== Testing ====
> > > > > >
> > > > > > Sketches, by their nature are probabilistic programs and don’t
> > > > necessarily
> > > > > > behave deterministically.  For some of the sketches we
> > intentionally
> > > > insert
> > > > > > random noise into the code as this gives us the mathematical
> > > properties
> > > > > > that we need to guarantee accuracy.  This can make the behavior of
> > > > these
> > > > > > algorithms quite unintuitive and provides significant challenges to
> > > the
> > > > > > developer who wishes to test these algorithms for correctness. As a
> > > > result,
> > > > > > our testing strategy includes two major components: unit tests, and
> > > > > > characterization tests.
> > > > > >
> > > > > > ===== Unit Testing =====
> > > > > >
> > > > > > Our unit tests are primarily quick tests to make sure that we
> > > exercise
> > > > all
> > > > > > critical paths in the code and that key branches are executed
> > > > correctly. It
> > > > > > is important that they execute relatively fast as they are
> > generally
> > > > run on
> > > > > > every code build. The sketches-core repository alone has about 22
> > > > thousand
> > > > > > statements, over 1300 unit tests and code coverage of about 98.2%
> > as
> > > > > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > > > > repositories that are used in production that they have code
> > coverage
> > > > > > greater than 90%.
> > > > > >
> > > > > > ===== Characterization Testing =====
> > > > > >
> > > > > > In order to test the probabilistic methods that are used to
> > interpret
> > > > the
> > > > > > stochastic behaviors of our sketches we have a separate
> > > > characterization
> > > > > > repository that is dedicated to this.  To measure accuracy, for
> > > > example,
> > > > > > requires running thousands of trials at each of many different
> > points
> > > > along
> > > > > > the domain axis. Each trial compares its estimated results against
> > a
> > > > known
> > > > > > exact result producing an error for that trial.  These error
> > > > measurements
> > > > > > are then fed into our Quantiles sketch to capture the actual
> > > > distribution
> > > > > > of error at that point along the axis. We then select quantile
> > > contours
> > > > > > across all the distributions at points along the axis.  These
> > > contours
> > > > can
> > > > > > then be plotted to reveal the shape of the actual error
> > distribution.
> > > > These
> > > > > > distributions are not at all Gaussian, in fact they can be quite
> > > > complex.
> > > > > > Nonetheless, these distributions are then checked against our
> > > > statistical
> > > > > > guarantees inherent to the specific sketch algorithm and its
> > > > parameters.
> > > > > > There are many examples of these characterization error
> > distributions
> > > > on
> > > > > > our website. The runtimes of these tests can be very long and can
> > > range
> > > > > > from many minutes to hours, and some can run for days.  Currently,
> > we
> > > > have
> > > > > > separate characterization repositories for Java and C++ / Python.
> > > > > >
> > > > > > It is our goal that we perform this characterization analysis for
> > all
> > > > of
> > > > > > our sketches.  By definition, the code that runs these
> > > characterization
> > > > > > tests is open-source so others can run these tests as well.  We do
> > > not
> > > > have
> > > > > > formal releases of this code (because it is not production code)
> > and
> > > > it is
> > > > > > not published to Maven Central.
> > > > > >
> > > > > > === Meritocracy ===
> > > > > >
> > > > > > DataSketches was initially developed based on requirements within
> > > > Yahoo. As
> > > > > > a project on GitHub, DataSketches has received contributions from
> > > > numerous
> > > > > > individual developers from around the world, dedicated research
> > work
> > > > from
> > > > > > senior scientists at Amazon and Visa, and academic researchers from
> > > > > > Georgetown University, Princeton, and MIT.
> > > > > >
> > > > > > As a project under incubation, we are committed to expanding our
> > > > effort to
> > > > > > build an environment which supports a meritocracy. We are focused
> > on
> > > > > > engaging the community and other related projects for support and
> > > > > > contributions. Moreover, we are committed to ensure contributors
> > and
> > > > > > committers to DataSketches come from a broad mix of organizations
> > > > through a
> > > > > > merit-based decision process during incubation. We believe strongly
> > > in
> > > > the
> > > > > > DataSketches premise that fulfills the concept of a well engineered
> > > and
> > > > > > scientifically rigorous library that implements these powerful
> > > > algorithms
> > > > > > and are committed to growing an inclusive community of DataSketches
> > > > > > contributors and users.
> > > > > >
> > > > > > === Community ===
> > > > > >
> > > > > > Yahoo has a long history and active engagement in the Open Source
> > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > > Panoptes,
> > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> > > > gifshot,
> > > > > > fluxible, as well as the creation, contribution and incubation of
> > > many
> > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> > > > Zookeeper,
> > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > >
> > > > > > Every day, DataSketches is actively used by a organizations and
> > > > > > institutions around the world for batch and stream processing of
> > > data.
> > > > We
> > > > > > believe acceptance will allow us to consolidate existing
> > > > > > DataSketches-related work, grow the DataSketches community, and
> > > deepen
> > > > > > connections between DataSketches and other open source projects.
> > > > > >
> > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > >
> > > > > > The core developers and contributors for DataSketches are from
> > > diverse
> > > > > > backgrounds, but primarily are scientists that love engineering and
> > > > > > engineers that love science. A large part of the value we bring
> > comes
> > > > from
> > > > > > this synthesis.  These individuals have already contributed
> > > > substantially
> > > > > > to the code, algorithms, and/or mathematical proofs that form the
> > > > basis of
> > > > > > the library.
> > > > > >
> > > > > > This core group also form the Initial Committers with write
> > > > permissions to
> > > > > > the repository. Those marked with (*) Meet weekly to plan the
> > > research
> > > > and
> > > > > > engineering direction of the project.
> > > > > >
> > > > > > ==== Scientists That Love Engineering ====
> > > > > >
> > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> > > > Interests:
> > > > > > distributed systems, scalable systems and platforms for big data
> > > > > > processing, concurrent algorithms and data structures,
> > > > > >
> > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> > > > Sunnyvale,
> > > > > > California. Interests: algorithms, theoretical and applied
> > > mathematics,
> > > > > > encoding and compression theory, theoretical and applied
> > performance
> > > > > > optimization.
> > > > > >
> > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs,
> > Palo
> > > > Alto,
> > > > > > California. Manages the algorithms group at Amazon AI. We build
> > > > scalable
> > > > > > machine learning systems and algorithms which are used both
> > > internally
> > > > and
> > > > > > externally by customers of SageMaker, AWS's flagship machine
> > learning
> > > > > > platform.
> > > > > >
> > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > Interests:
> > > > > > Computational advertising, machine learning, speech recognition,
> > > > > > data-driven analysis, large scale experimentation, big data,
> > > > stream/complex
> > > > > > event processing
> > > > > >
> > > > > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> > > > Science,
> > > > > > Georgetown University, Washington D.C. Interests: algorithms and
> > > > > > computational complexity, complexity theory, quantum algorithms,
> > > > private
> > > > > > data analysis, and learning theory, developing efficient streaming
> > > and
> > > > > > sketching algorithms
> > > > > >
> > > > > > ==== Engineers That Love Science ====
> > > > > >
> > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> > > > Interests:
> > > > > > design and implementation of data storing and data processing
> > > > (distributed)
> > > > > > systems, performance optimization, CPU performance, mechanical
> > > > sympathy,
> > > > > > JVM performance, API design, databases, (concurrent) data
> > structures,
> > > > > > memory management, garbage collection algorithms, language design
> > and
> > > > > > runtimes (their tradeoffs), distributed systems (cloud) efficiency,
> > > > Linux,
> > > > > > code quality, code transformation, pure functional programming
> > > models,
> > > > > > Haskell.
> > > > > >
> > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> > founder
> > > > of
> > > > > > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > > > > > streaming algorithms, mathematics, computer science, high quality
> > and
> > > > high
> > > > > > performance code for the analysis of massive data, bridging the
> > > divide
> > > > > > between theory and practice.
> > > > > >
> > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > Sunnyvale,
> > > > > > California. Interests: applied mathematics, computer science, big
> > > data,
> > > > > > distributed systems.
> > > > > >
> > > > > > === Introduction to Additional Interested Contributors ===
> > > > > >
> > > > > > These folks have been intermittently involved and contributed, but
> > > are
> > > > > > strong supporters of this project.
> > > > > >
> > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > >
> > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> > > > Science,
> > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > > > > approximation, streaming algorithms, randomized linear algebra.
> > > > > >
> > > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> > > > Computer
> > > > > > Science, Research Instructor, Princeton University. Interests:
> > > > algorithmic
> > > > > > foundations of data science and machine learning, efficient methods
> > > for
> > > > > > processing and understanding large datasets, often working at the
> > > > > > intersection of theoretical computer science, numerical linear
> > > > algebra, and
> > > > > > optimization.
> > > > > >
> > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer
> > > > Science,
> > > > > > Professor, Warwick University, Warwick, England. Interests: all
> > > > aspects of
> > > > > > the "data lifecycle", from data collection and cleaning, through
> > > > mining and
> > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > scientists
> > > > in
> > > > > > sketching algorithms)
> > > > > >
> > > > > > === Alignment ===
> > > > > >
> > > > > > The DataSketches library already provides integrations and example
> > > > code for
> > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into
> > > > Apache
> > > > > > Druid.
> > > > > >
> > > > > > == Known Risks ==
> > > > > >
> > > > > > The following subsections are specific risks that have been
> > > identified
> > > > by
> > > > > > the ASF that need to be addressed.
> > > > > >
> > > > > > === Risk: Orphaned Products ===
> > > > > >
> > > > > > The DataSketches library is presently used by a number of
> > > > organizations,
> > > > > > from small startups to Fortune 100 companies, to construct
> > production
> > > > > > pipelines that must process and analyze massive data. Yahoo has a
> > > > long-term
> > > > > > commitment to continue to advance the DataSketches library;
> > moreover,
> > > > > > DataSketches is seeing increasing interest, development, and
> > adoption
> > > > from
> > > > > > many diverse organizations from around the world. Due to its
> > growing
> > > > > > adoption, we feel it is quite unlikely that this project would
> > become
> > > > > > orphaned.
> > > > > >
> > > > > > === Risk: Inexperience with Open Source ===
> > > > > >
> > > > > > Yahoo believes strongly in open source and the exchange of
> > > information
> > > > to
> > > > > > advance new ideas and work. Examples of this commitment are active
> > > open
> > > > > > source projects such as those mentioned above. With DataSketches,
> > we
> > > > have
> > > > > > been increasingly open and forward-looking; we have published a
> > > number
> > > > of
> > > > > > papers about breakthrough developments in the science of streaming
> > > > > > algorithms (mentioned above) that also reference the DataSketches
> > > > library.
> > > > > > Our submission to the Apache Software Foundation is a logical
> > > > extension of
> > > > > > our commitment to open source software.
> > > > > >
> > > > > > Key committers at Yahoo with strong open source backgrounds include
> > > > Aaron
> > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> > > Andrews
> > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call,
> > > Daryn
> > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
> > > Hillel,
> > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > Perez-Sorrosal,
> > > > Gil
> > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> > > > Penick,
> > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> > > > Kihwal
> > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > Trelinski,
> > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > Natkovich,
> > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo,
> > > > Ryan
> > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan,
> > > Sri
> > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > > > > >
> > > > > > All of our core developers are committed to learn about the Apache
> > > > process
> > > > > > and to give back to the community.
> > > > > >
> > > > > > === Risk: Homogeneous Developers ===
> > > > > >
> > > > > > The majority of committers in this proposal belong to Yahoo due to
> > > the
> > > > fact
> > > > > > that DataSketches has emerged from an internal Yahoo project. This
> > > > proposal
> > > > > > also includes developers and contributors from other companies, and
> > > > who are
> > > > > > actively involved with other Apache projects, such as Druid.  We
> > > > expect our
> > > > > > entry into incubation will allow us to expand the number of
> > > > individuals and
> > > > > > organizations participating in DataSketches development.
> > > > > >
> > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > >
> > > > > > Because the DataSketches library originated within Yahoo, it has
> > been
> > > > > > developed primarily by salaried Yahoo developers and we expect that
> > > to
> > > > > > continue to be the case near term. However, since we placed this
> > > > library
> > > > > > into open-source we have had a number of significant contributions
> > > from
> > > > > > engineers and scientists from outside of Yahoo. We expect our
> > > reliance
> > > > on
> > > > > > Yahoo salaried developers will decrease over time. Nonetheless,
> > Yahoo
> > > > is
> > > > > > committed to continue its strong support of this important project.
> > > > > >
> > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > >
> > > > > > DataSketches already directly interoperates with or utilizes
> > several
> > > > > > existing Apache projects.
> > > > > >
> > > > > > * Build
> > > > > >    * Apache Maven
> > > > > >
> > > > > > * Integrations and adaptors for the following projects naturally
> > have
> > > > them
> > > > > > as dependencies
> > > > > >    * Apache Hive
> > > > > >    * Apache Pig
> > > > > >    * Apache Druid
> > > > > >    * Apache Spark
> > > > > >
> > > > > > * Additional dependencies for the above integrations and adaptors
> > > > include
> > > > > >    * Apache Hadoop
> > > > > >    * Apache Commons (Math)
> > > > > >
> > > > > > There is no other Apache project that we are aware of that
> > duplicates
> > > > the
> > > > > > functionality of the DataSketches library.
> > > > > >
> > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > > >
> > > > > > With this proposal we are not seeking attention or publicity.
> > Rather,
> > > > we
> > > > > > firmly believe in the DataSketches library and concept and the
> > > ability
> > > > to
> > > > > > make the DataSketches library a powerful, yet simple-to-use toolkit
> > > for
> > > > > > data processing. While the DataSketches library has been open
> > source,
> > > > we
> > > > > > believe putting code on GitHub can only go so far. We see the
> > Apache
> > > > > > community, processes, and mission as critical for ensuring the
> > > > DataSketches
> > > > > > library is truly community-driven, positively impactful, and
> > > innovative
> > > > > > open source software. While Yahoo has taken a number of steps to
> > > > advance
> > > > > > its various open source projects, we believe the DataSketches
> > library
> > > > > > project is a great fit for the Apache Software Foundation due to
> > its
> > > > focus
> > > > > > on data processing and its relationships to existing ASF projects.
> > > > > >
> > > > > > === Risk: Cryptography ===
> > > > > >
> > > > > > DataSketches does not contain any cryptographic code and is not a
> > > > > > cryptographic product.
> > > > > >
> > > > > > == Documentation ==
> > > > > >
> > > > > > The following documentation is relevant to this proposal. Relevant
> > > > portions
> > > > > > of the documentation will be contributed to the Apache DataSketches
> > > > > > project.
> > > > > >
> > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > >
> > > > > > * DataSketches website repository:
> > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > >
> > > > > > We will need an apache website for this documentation similar to
> > > > > >
> > > > > > * https://datasketches.apache.org
> > > > > >
> > > > > > == Initial Source ==
> > > > > >
> > > > > > The initial source for DataSketches which we will submit to the
> > > Apache
> > > > > > Foundation will include a number of repositories which are
> > currently
> > > > hosted
> > > > > > under the GitHub.com/datasketches organization:
> > > > > >
> > > > > > All github.com/datasketches repositories including:
> > > > > >
> > > > > > * Java
> > > > > >    * sketches-core: This repository has the core sketching classes,
> > > > which
> > > > > > are leveraged by some of the other repositories. This repository
> > has
> > > no
> > > > > > external dependencies outside of the DataSketches/memory
> > repository,
> > > > Java
> > > > > > and TestNG for unit tests. This code is versioned and the latest
> > > > release
> > > > > > can be obtained from Maven Central.
> > > > > >    * memory: Low level, high-performance memory data-structure
> > > > management
> > > > > > primarily for off-heap.
> > > > > >    * sketches-android: This is a new repository dedicated to
> > sketches
> > > > > > designed to be run in a mobile client, such as a cell phone. It is
> > > > still in
> > > > > > development and should be considered experimental.
> > > > > >    * sketches-hive: This repository contains Hive UDFs and UDAFs
> > for
> > > > use
> > > > > > within Hadoop grid environments. This code has dependencies on
> > > > > > sketches-core as well as Hadoop and Hive. Users of this code are
> > > > advised to
> > > > > > use Maven to bring in all the required dependencies. This code is
> > > > versioned
> > > > > > and the latest release can be obtained from Maven Central.
> > > > > >    * sketches-pig: This repository contains Pig User Defined
> > > Functions
> > > > > > (UDF) for use within Hadoop grid environments. This code has
> > > > dependencies
> > > > > > on sketches-core as well as Hadoop and Pig. Users of this code are
> > > > advised
> > > > > > to use Maven to bring in all the required dependencies. This code
> > is
> > > > > > versioned and the latest release can be obtained from Maven
> > Central.
> > > > > >    * sketches-vector: This is a new repository dedicated to
> > sketches
> > > > for
> > > > > > vector and matrix operations. It is still somewhat experimental.
> > > > > >    * characterization: This relatively new repository is for code
> > > that
> > > > we
> > > > > > use to characterize the accuracy and speed performance of the
> > > sketches
> > > > in
> > > > > > the library and is constantly being updated. Examples of the job
> > > > command
> > > > > > files used for various tests can be found in the src/main/resources
> > > > > > directory. Some of these tests can run for hours depending on its
> > > > > > configuration.
> > > > > >    * experimental: This repository is an experimental staging area
> > > for
> > > > code
> > > > > > that will eventually end up in another repository. This code is not
> > > > > > versioned and not registered with Maven Central.
> > > > > >    * sketches-misc: Demos and other code not related to production
> > > > > > deployment
> > > > > >
> > > > > > * C++ and Python
> > > > > >    * sketches-core-cpp: This is the C++/Python companion to the
> > Java
> > > > > > sketches-core. These implementations are binary compatible with
> > their
> > > > > > counterparts in Java. In other words, a sketch created and stored
> > in
> > > > C++
> > > > > > can be opened and read in Java and visa-versa. This site also has
> > our
> > > > > > Python adaptors that basically wrap the C++ implementations, making
> > > the
> > > > > > high performance C++ implementations available from Python.
> > > > > >    * sketches-postgres: This site provides the postgres-specific
> > > > adaptors
> > > > > > that wrap the C++ implementations making them available to the
> > > Postgres
> > > > > > database users.
> > > > > >    * characterization-cpp: This is the C++/Python companion to the
> > > Java
> > > > > > characterization repository.
> > > > > >    * experimental-cpp: This repository is an experimental staging
> > > area
> > > > for
> > > > > > C++ code that will eventually end up in another repository.
> > > > > >
> > > > > > * Command-Line Tools
> > > > > >    * sketches-cmd
> > > > > >    * homebrew-sketches
> > > > > >    * homebrew-sketches-cmd
> > > > > >
> > > > > > These projects have always been Apache 2.0 licensed. We intend to
> > > > bundle
> > > > > > all of these repositories since they are all complementary and
> > should
> > > > be
> > > > > > maintained in one project. Prior to our submission, we will combine
> > > > all of
> > > > > > these projects into a new git repository.
> > > > > >
> > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > >
> > > > > > Contributors to the DataSketches project have also signed the Yahoo
> > > > > > Individual Contributor License Agreement (
> > > > https://yahoocla.herokuapp.com/
> > > > > > in order to contribute to the project.
> > > > > >
> > > > > > With respect to trademark rights, Yahoo does not hold a trademark
> > on
> > > > the
> > > > > > phrase “DataSketches.” Based on feedback and guidance we receive
> > > > during the
> > > > > > incubation process, we are open to renaming the project if
> > necessary
> > > > for
> > > > > > trademark or other concerns, but we would prefer not to have to do
> > > > that.
> > > > > >
> > > > > > == External Dependencies ==
> > > > > >
> > > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > > Apache-compatible license. As we grow the DataSketches community we
> > > > will
> > > > > > configure our build process to require and validate all
> > contributions
> > > > and
> > > > > > dependencies are licensed under the Apache 2.0 license or are under
> > > an
> > > > > > Apache-compatible license.
> > > > > >
> > > > > > == Required Resources ==
> > > > > >
> > > > > > === Mailing Lists ===
> > > > > >
> > > > > > We currently use a mix of mailing lists. We will migrate our
> > existing
> > > > > > mailing lists to the following:
> > > > > >
> > > > > > * dev@datasketches.incubator.apache.org
> > > > > >
> > > > > > * user@datasketches.incubator.apache.org
> > > > > >
> > > > > > * private@datasketches.incubator.apache.org
> > > > > >
> > > > > > * commits@datasketches.incubator.apache.org
> > > > > >
> > > > > > === Source Control ===
> > > > > >
> > > > > > The DataSketches team currently uses Git and would like to continue
> > > to
> > > > do
> > > > > > so. We request a Git repository for DataSketches with mirroring to
> > > > GitHub
> > > > > > enabled similar the following:
> > > > > >
> > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > >
> > > > > > === Issue Tracking ===
> > > > > >
> > > > > > We request the creation of an Apache-hosted JIRA. The DataSketches
> > > > project
> > > > > > is currently using the public GitHub issue tracker and the public
> > > > Google
> > > > > > Groups forum/sketches-user for issue tracking and discussions. We
> > > will
> > > > > > migrate and combine from these two sources to the Apache JIRA.
> > > > > >
> > > > > > Proposed Jira ID: DATASKETCHES
> > > > > >
> > > > > > == Initial Committers ==
> > > > > >
> > > > > > The following list of individuals have been extremely active in our
> > > > > > community and should have write (commit) permissions to the
> > > repository.
> > > > > >
> > > > > > * Eshcar Hillel                      [eshcar at verizonmedia dot
> > com]
> > > > > >
> > > > > > * Kevin Lang                    [langk at verizonmedia dot com]
> > > > > >
> > > > > > * Roman Leventov              [roman.leventov at c.metamarkets dot
> > > com]
> > > > > >
> > > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > > >
> > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > > > > >
> > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> > > > [leerho
> > > > > > at gmail dot com]
> > > > > >
> > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > > > > >
> > > > > > * Justin Thaler                 [justin.thaler at georgetown dot
> > edu]
> > > > > >
> > > > > > == Affiliations ==
> > > > > >
> > > > > > The initial committers are from four organizations: Yahoo, Amazon,
> > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > >
> > > > > > === Champion ===
> > > > > > (Recommended to me: )
> > > > > >
> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > > apache
> > > > > > dot org]
> > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > >
> > > > > > === Nominated Mentors ===
> > > > > > (Recommended to me: )
> > > > > >
> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > > apache
> > > > > > dot org]
> > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > >
> > > > > > === Sponsoring Entity ===
> > > > > >
> > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > >
> > > > > > * Apache Druid. The incubating Apache Druid project might also be a
> > > > logical
> > > > > > sponsor. However, DataSketches has applications in many areas of
> > > > computing
> > > > > > outside of Druid so our preference and recommendation is that
> > > > DataSketches
> > > > > > would ultimately be a top-level Apache project.
> > > > > >
> > > > > > ________________
> > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> > > > acquired
> > > > > > AOL. The merged entity was originally called Oath, Inc., but has
> > > > recently
> > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> > > Verizon,
> > > > > > Inc.  Since Yahoo is the more recognized name, references in this
> > > > document
> > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <ke...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > > The subject line has me interested already. Follow examples like
> > > this
> > > > > > > maybe?
> > > > > > >
> > > > > > > 1.
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > 2.
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > >
> > > > > > > Kenn
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com> wrote:
> > > > > > >
> > > > > > > > I'll try again ... :)
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > ted.dunning@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> It didn't make it again
> > > > > > > >>
> > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com>
> > wrote:
> > > > > > > >>
> > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > >> >
> > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com>
> > > > wrote:
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > general-unsubscribe@incubator.apache.org
> > > > > > > > For additional commands, e-mail:
> > > general-help@incubator.apache.org
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > > >
> > >
> > --
> > From my cell phone.
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: DataSketches Proposal

Posted by Ted Dunning <te...@gmail.com>.

There is also the general question of whether it is better to be a
top-level project or to become a contribution to commons math.



On Sun, Feb 24, 2019 at 10:56 PM leerho <le...@gmail.com> wrote:

> Yes I will try that tomorrow.
>
> On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> > Can you share the Google doc with the proposal? Per Ted's advice, we can
> > iterate quickly there and move it to the wiki when it becomes a bit more
> > stable.
> >
> > Kenn
> >
> > On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com>
> > wrote:
> >
> > > Thanks for the offer.  i am a neophyte at this process and email app!
>  I
> > > could use a lot of help getting this off the ground!  Also, I'm not
> sure
> > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> > >
> > > Lee.
> > >
> > > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > > Nice.
> > > >
> > > > I would very much like to help mentor this project, though you
> already
> > > have
> > > > a couple good ones.
> > > >
> > > > I concur with incubator as sponsoring entity.
> > > >
> > > > Kenn (VP Apache Beam)
> > > >
> > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> > > >
> > > > > I didn't realize that this mail list does not accept PDF files,
> > > apparently
> > > > > only text.  So let me try one more time ... :)  Please let me know
> if
> > > > > this works!
> > > > >
> > > > >
> > > > > = Apache DataSketches Proposal[1] =
> > > > >
> > > > > == Abstract ==
> > > > >
> > > > > DataSketches.GitHub.io is an open source, high-performance library
> > of
> > > > > stochastic streaming algorithms commonly called "sketches" in the
> > data
> > > > > sciences. Sketches are small, stateful programs that process
> massive
> > > data
> > > > > as a stream and can provide approximate answers, with mathematical
> > > > > guarantees, to computationally difficult queries
> orders-of-magnitude
> > > faster
> > > > > than traditional, exact methods.
> > > > >
> > > > > This proposal is to move DataSketches to the Apache Software
> > > > > Foundation(ASF) transferring ownership of its copyright
> intellectual
> > > > > property to the ASF.  Thereafter, DataSketches would be officially
> > > known as
> > > > > Apache DataSketches and its evolution and governance would come
> under
> > > the
> > > > > rules and guidance of the ASF.
> > > > >
> > > > > == Introduction ==
> > > > >
> > > > > The DataSketches library contains carefully crafted implementations
> > of
> > > > > sketch algorithms that meet rigorous standards of quality and
> > > performance
> > > > > and provide capabilities required for large-scale production
> systems
> > > that
> > > > > must process and analyze massive data. The DataSketches core
> > > repository is
> > > > > written in Java with a parallel core repository written in C++ that
> > > > > includes Python wrappers. The DataSketches library also includes
> > > special
> > > > > repositories for extending the core library for Apache Hive and
> > Apache
> > > Pig.
> > > > > The sketches developed in the different languages share a common
> > binary
> > > > > storage format so that sketches created and stored in Java, for
> > > example,
> > > > > can be fully used in C++, and visa versa.  Because the stored
> sketch
> > > > > "images" are just a "blob" of bytes (similar to picture images),
> they
> > > can
> > > > > be shared across many different systems, languages and platforms.
> > > > >
> > > > > The DataSketches documentation website,
> > https://datasketches.github.io
> > > ,
> > > > > includes general tutorials, a comprehensive research section with
> > > > > references to relevant academic papers, extensive examples for
> using
> > > the
> > > > > core library directly as well as examples for accessing the library
> > in
> > > > > Hive, Pig, and Apache Spark.
> > > > >
> > > > > The DataSketches library also includes a characterization
> repository
> > > for
> > > > > long running test programs that are used for studying accuracy and
> > > > > performance of these sketches over wide ranges of input variables.
> > The
> > > data
> > > > > produced by these programs is used for generating the many
> > performance
> > > > > plots contained in the documentation website and for academic
> > > > > publications.
> > > > >
> > > > > The code repositories used for production are versioned and
> published
> > > to
> > > > > Maven Central on periodic intervals as the library evolves.
> > > > >
> > > > > The DataSketches library also includes several experimental
> > > repositories
> > > > > for use-cases outside the large-scale systems environments, such as
> > > > > sketches for mobile, IoT devices (Android), command-line access of
> > the
> > > > > sketch library, and an experimental repository for vector-based
> > > sketches
> > > > > that performs approximate Singular Value Decomposition (SVD)
> analysis
> > > that
> > > > > could potentially be used in Machine Learning (ML) applications.
> > > > >
> > > > > == Background ==
> > > > >
> > > > > The DataSketches library was started in 2012 as internal Yahoo
> > project
> > > to
> > > > > dramatically reduce time and resources required for distinct
> (unique)
> > > > > counting.  An extensive search on the Internet at the time yielded
> a
> > > number
> > > > > of theoretical papers on stochastic streaming algorithms with
> > > pseudocode
> > > > > examples, but we did not find any usable open-source code of the
> > > quality we
> > > > > felt we needed for our internal production systems.  So we started
> a
> > > small
> > > > > project (one person) to develop our own sketches working directly
> > from
> > > > > published theoretical papers.
> > > > >
> > > > > The DataSketches library was designed from the start with the
> > > objective of
> > > > > making these algorithms, usually only described in theoretical
> > papers,
> > > > > easily accessible to systems developers for use in our internal
> > > production
> > > > > systems. By necessity, the code had to be of the highest quality
> and
> > > > > thoroughly tested. The wide variety of our internal production
> > systems
> > > > > drove the requirement that the sketch implementations had to have
> an
> > > > > absolute minimum of external, run-time dependencies in order to
> > > simplify
> > > > > integration and troubleshooting.
> > > > >
> > > > > Our internal experiments demonstrated dramatic positive impact on
> the
> > > > > performance of our systems.  As a result, the DataSketches library
> > > quickly
> > > > > evolved to include different types of sketches for different types
> of
> > > > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > > > > quantile/histogram algorithms, and weighted and unweighted sampling
> > > > > algorithms.
> > > > >
> > > > > We quickly discovered that developing these sketch algorithms to be
> > > truly
> > > > > robust in production environments is quite difficult and requires
> > deep
> > > > > understanding of the underlying mathematics and statistics as well
> as
> > > > > extensive experience in developing high quality code for 24/7
> > > production
> > > > > systems. This is a difficult combination of skills for any one
> > > organization
> > > > > to collect and maintain over time. It became clear that this
> > technology
> > > > > needed a community larger than Yahoo to evolve.  In November, 2015,
> > > this
> > > > > factor, along with Yahoo’s strong experience and support of open
> > > source,
> > > > > led to the decision to open source this technology under an Apache
> > 2.0
> > > > > license on GitHub. Since that time our community has expanded
> > > considerably
> > > > > and the key contributors to this effort includes leading research
> > > > > scientists from a number of universities as well as practitioners
> and
> > > > > researchers from a number of major corporations. The core of this
> > > group is
> > > > > very active as we meet weekly to discuss research directions and
> > > > > engineering priorities.
> > > > >
> > > > > It is important to note that our internal systems at Yahoo use the
> > > current
> > > > > public GitHub open source DataSketches library and not an internal
> > > version
> > > > > of the code.
> > > > >
> > > > > The close collaboration of scientific research and engineering
> > > development
> > > > > experience with actual massive-data processing systems has also
> > > produced
> > > > > new research publications in the field of stochastic streaming
> > > algorithms,
> > > > > for example:
> > > > >
> > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> > > Rhodes, and
> > > > > Justin Thaler. A high-performance algorithm for identifying
> frequent
> > > items
> > > > > in data streams. In ACM IMC 2017.
> > > > >
> > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > > > > framework for estimating stream expression cardinalities. In
> > *EDBT/ICDT
> > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > >
> > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings
> > > ‘16,
> > > > > pages 845-854, 2016.
> > > > >
> > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
> 71–78,
> > > 2016.
> > > > >
> > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
> > > cardinality
> > > > > estimation algorithm. arXiv preprint
> > https://arxiv.org/abs/1708.06839,
> > > > > 2017.
> > > > >
> > > > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM
> KDD
> > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > >
> > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> > > Ullman.
> > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> > > Proceedings
> > > > > ‘16, pages 441–454, 2016.
> > > > >
> > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > Hierarchical
> > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > > Proceedings
> > > > > ‘12, pages 160–174, 2012.
> > > > >
> > > > > == The Rationale for Sketches ==
> > > > >
> > > > > In the analysis of big data there are often problem queries that
> > don’t
> > > > > scale because they require huge compute resources and time to
> > generate
> > > > > exact results. Examples include count distinct, quantiles, most
> > > frequent
> > > > > items, joins, matrix computations, and graph analysis.
> > > > >
> > > > > If we can loosen the requirement of “exact” results from our
> queries
> > > and be
> > > > > satisfied with approximate results, within some well understood
> > bounds
> > > of
> > > > > error, there is an entire branch of mathematics and data science
> that
> > > has
> > > > > evolved around developing algorithms that can produce approximate
> > > results
> > > > > with mathematically well-defined error properties.
> > > > >
> > > > > With the additional requirements that these algorithms must be
> small
> > > > > (compared to the size of the input data), sublinear (the size of
> the
> > > sketch
> > > > > must grow at a slower rate than the size of the input stream),
> > > streaming
> > > > > (they can only touch each data item once), and mergeable (suitable
> > for
> > > > > distributed processing), defines a class of algorithms that can be
> > > > > described as small, stochastic, streaming, sublinear mergeable
> > > algorithms,
> > > > > commonly called sketches (they also have other names, but we will
> use
> > > the
> > > > > term sketches from here on).
> > > > >
> > > > > To be truly streaming and be able to process data in a single pass,
> > > > > sketches must make absolute minimum assumptions about the input
> > stream.
> > > > > This is critically important, as there is no “second chance” to
> > > process the
> > > > > data.
> > > > >
> > > > > For example, sketches should not make assumptions about the order
> of
> > > stream
> > > > > items, the stream length, the dynamic range of values, or the
> > > distribution
> > > > > of item occurrence frequencies. Sketches should be tolerant of
> NaNs,
> > > Nulls
> > > > > and empty objects. About the only thing that the sketch needs to
> know
> > > about
> > > > > the stream is how to extract items from it and what type the item
> is,
> > > e.g.,
> > > > > is it a numeric value or a string.
> > > > >
> > > > > As far as the sketch is concerned, the input stream is a sequence
> of
> > > items
> > > > > in some unknown random order with unknown random values.
> > > > >
> > > > > The sketch is essentially a complex state machine and combined with
> > the
> > > > > random input stream defines a stochastic process. We then apply
> > > > > probabilistic methods to interpret the states of the stochastic
> > > process in
> > > > > order to extract useful information about the input stream itself.
> > The
> > > > > resulting information will be approximate, but we also use
> additional
> > > > > probabilistic methods to extract an estimate of the likely
> > probability
> > > > > distribution of error.
> > > > >
> > > > > There is a significant scientific contribution here that is
> defining
> > > the
> > > > > state machine, understanding the resulting stochastic process,
> > > developing
> > > > > the probabilistic methods, and proving mathematically, that it all
> > > works!
> > > > > This is why the scientific contributors to this project are a
> > critical
> > > and
> > > > > strategic component to our success.  The development engineers
> > > translate
> > > > > the concepts of the proposed state machine and probabilistic
> methods
> > > into
> > > > > production-quality code. Even more important, they work closely
> with
> > > the
> > > > > scientists, feeding back system and user requirements, which leads
> > not
> > > only
> > > > > to superior product design, but to new science as well.  A number
> of
> > > > > scientific papers our members have published (see above) is a
> direct
> > > result
> > > > > of this close collaboration.
> > > > >
> > > > > Because sketches are small they can be processed extremely fast,
> > often
> > > many
> > > > > orders-of-magnitude faster than traditional exact computations. For
> > > > > interactive queries there may not be other viable alternatives, and
> > in
> > > the
> > > > > case of real-time analysis, sketches are the only known solution.
> > > > >
> > > > > For any system that needs to extract useful information from
> massive
> > > data
> > > > > sketches are essential tools that should be tightly integrated into
> > the
> > > > > system’s analysis capabilities. This technology has helped Yahoo
> > > > > successfully reduce data processing times from days to hours or
> > > minutes on
> > > > > a number of its internal platforms and has enabled subsecond
> queries
> > on
> > > > > real-time platforms that would have been infeasible without
> sketches.
> > > > > The Rationale for Apache DataSketches
> > > > > Other open source implementations of sketch algorithms can be found
> > on
> > > the
> > > > > Internet. However, we have not yet found any open source
> > > implementations
> > > > > that are as comprehensive, engineered with the quality required for
> > > > > production systems, and with usable and guaranteed error
> properties.
> > > Large
> > > > > Internet companies, such as Google and Facebook, have published
> > papers
> > > on
> > > > > sketching, however, their implementations of their published
> > > algorithms are
> > > > > proprietary and not available as open source.
> > > > >
> > > > > The DataSketches library already provides integrations with a
> number
> > of
> > > > > major Apache data processing platforms such as Apache Hive, Apache
> > Pig,
> > > > > Apache Spark and Apache Druid, and is also integrated with a number
> > of
> > > > > other open source data processing platforms such as Splice Machine,
> > > GCHQ
> > > > > Gaffer and PostgreSQL.
> > > > >
> > > > > We believe that having DataSketches as an Apache project will
> provide
> > > an
> > > > > immediate, worthwhile, and substantial contribution to the open
> > source
> > > > > community, will have a better opportunity to provide a meaningful
> > > > > contribution to both the science and engineering of sketching
> > > algorithms,
> > > > > and integrate with other Apache projects.  In addition, this is a
> > > > > significant opportunity for Apache to be the "go-to" destination
> for
> > > users
> > > > > that want to leverage this exciting technology.
> > > > >
> > > > > == Initial Goals ==
> > > > >
> > > > > We are breaking our initial goals into short-term (2-6 months) and
> > > > > intermediate to long-term ( 6 months to 2 years):
> > > > >
> > > > > Our short-term goals include:
> > > > >
> > > > > * Understanding and adapting to the Apache development process and
> > > > > structures.
> > > > >
> > > > > * Start refactoring codebase and move various DataSketches
> > repositories
> > > > > code to Apache Git repository.
> > > > >
> > > > > * Continue development of new features, functions, and fixes.
> > > > >
> > > > > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > > > > developed and expanded.
> > > > >
> > > > >
> > > > > The intermediate to long term goals include:
> > > > >
> > > > > * Completing the design and implementation of the C++ sketches to
> > > > > complement what is already available in Java, and the Python
> wrappers
> > > of
> > > > > those C++ sketches.
> > > > >
> > > > > * Expanding the C++ build framework to include Windows and the
> > popular
> > > > > Linux variants.
> > > > >
> > > > > * Continued engagement with the scientific research community on
> the
> > > > > development of new algorithms for computationally difficult
> problems
> > > that
> > > > > heretofore have not had a sketching solution.
> > > > >
> > > > > == Current Status ==
> > > > >
> > > > > The DataSketches GitHub project has been quite successful.  As of
> > this
> > > > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > > > Repository Manager at https://oss.sonatype.org has grown by
> nearly a
> > > > > factor
> > > > > of 10 over the past year to about 55 thousand per month. The
> > > > > DataSketches/sketches-core repository has about 560 stars and 141
> > > forks,
> > > > > which is pretty good for a highly specialized library.
> > > > >
> > > > > === Development Practices ===
> > > > >
> > > > > ==== Source Control ====
> > > > >
> > > > > All of our developers have extensive experience with Git version
> > > control
> > > > > and follow accepted practices for use of Pull Requests (PRs), code
> > > reviews
> > > > > and commits to master, for example.
> > > > >
> > > > > ==== Testing ====
> > > > >
> > > > > Sketches, by their nature are probabilistic programs and don’t
> > > necessarily
> > > > > behave deterministically.  For some of the sketches we
> intentionally
> > > insert
> > > > > random noise into the code as this gives us the mathematical
> > properties
> > > > > that we need to guarantee accuracy.  This can make the behavior of
> > > these
> > > > > algorithms quite unintuitive and provides significant challenges to
> > the
> > > > > developer who wishes to test these algorithms for correctness. As a
> > > result,
> > > > > our testing strategy includes two major components: unit tests, and
> > > > > characterization tests.
> > > > >
> > > > > ===== Unit Testing =====
> > > > >
> > > > > Our unit tests are primarily quick tests to make sure that we
> > exercise
> > > all
> > > > > critical paths in the code and that key branches are executed
> > > correctly. It
> > > > > is important that they execute relatively fast as they are
> generally
> > > run on
> > > > > every code build. The sketches-core repository alone has about 22
> > > thousand
> > > > > statements, over 1300 unit tests and code coverage of about 98.2%
> as
> > > > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > > > repositories that are used in production that they have code
> coverage
> > > > > greater than 90%.
> > > > >
> > > > > ===== Characterization Testing =====
> > > > >
> > > > > In order to test the probabilistic methods that are used to
> interpret
> > > the
> > > > > stochastic behaviors of our sketches we have a separate
> > > characterization
> > > > > repository that is dedicated to this.  To measure accuracy, for
> > > example,
> > > > > requires running thousands of trials at each of many different
> points
> > > along
> > > > > the domain axis. Each trial compares its estimated results against
> a
> > > known
> > > > > exact result producing an error for that trial.  These error
> > > measurements
> > > > > are then fed into our Quantiles sketch to capture the actual
> > > distribution
> > > > > of error at that point along the axis. We then select quantile
> > contours
> > > > > across all the distributions at points along the axis.  These
> > contours
> > > can
> > > > > then be plotted to reveal the shape of the actual error
> distribution.
> > > These
> > > > > distributions are not at all Gaussian, in fact they can be quite
> > > complex.
> > > > > Nonetheless, these distributions are then checked against our
> > > statistical
> > > > > guarantees inherent to the specific sketch algorithm and its
> > > parameters.
> > > > > There are many examples of these characterization error
> distributions
> > > on
> > > > > our website. The runtimes of these tests can be very long and can
> > range
> > > > > from many minutes to hours, and some can run for days.  Currently,
> we
> > > have
> > > > > separate characterization repositories for Java and C++ / Python.
> > > > >
> > > > > It is our goal that we perform this characterization analysis for
> all
> > > of
> > > > > our sketches.  By definition, the code that runs these
> > characterization
> > > > > tests is open-source so others can run these tests as well.  We do
> > not
> > > have
> > > > > formal releases of this code (because it is not production code)
> and
> > > it is
> > > > > not published to Maven Central.
> > > > >
> > > > > === Meritocracy ===
> > > > >
> > > > > DataSketches was initially developed based on requirements within
> > > Yahoo. As
> > > > > a project on GitHub, DataSketches has received contributions from
> > > numerous
> > > > > individual developers from around the world, dedicated research
> work
> > > from
> > > > > senior scientists at Amazon and Visa, and academic researchers from
> > > > > Georgetown University, Princeton, and MIT.
> > > > >
> > > > > As a project under incubation, we are committed to expanding our
> > > effort to
> > > > > build an environment which supports a meritocracy. We are focused
> on
> > > > > engaging the community and other related projects for support and
> > > > > contributions. Moreover, we are committed to ensure contributors
> and
> > > > > committers to DataSketches come from a broad mix of organizations
> > > through a
> > > > > merit-based decision process during incubation. We believe strongly
> > in
> > > the
> > > > > DataSketches premise that fulfills the concept of a well engineered
> > and
> > > > > scientifically rigorous library that implements these powerful
> > > algorithms
> > > > > and are committed to growing an inclusive community of DataSketches
> > > > > contributors and users.
> > > > >
> > > > > === Community ===
> > > > >
> > > > > Yahoo has a long history and active engagement in the Open Source
> > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> > Panoptes,
> > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> > > gifshot,
> > > > > fluxible, as well as the creation, contribution and incubation of
> > many
> > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> > > Zookeeper,
> > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > >
> > > > > Every day, DataSketches is actively used by a organizations and
> > > > > institutions around the world for batch and stream processing of
> > data.
> > > We
> > > > > believe acceptance will allow us to consolidate existing
> > > > > DataSketches-related work, grow the DataSketches community, and
> > deepen
> > > > > connections between DataSketches and other open source projects.
> > > > >
> > > > > === Introduction to the Core Developers & Contributors ===
> > > > >
> > > > > The core developers and contributors for DataSketches are from
> > diverse
> > > > > backgrounds, but primarily are scientists that love engineering and
> > > > > engineers that love science. A large part of the value we bring
> comes
> > > from
> > > > > this synthesis.  These individuals have already contributed
> > > substantially
> > > > > to the code, algorithms, and/or mathematical proofs that form the
> > > basis of
> > > > > the library.
> > > > >
> > > > > This core group also form the Initial Committers with write
> > > permissions to
> > > > > the repository. Those marked with (*) Meet weekly to plan the
> > research
> > > and
> > > > > engineering direction of the project.
> > > > >
> > > > > ==== Scientists That Love Engineering ====
> > > > >
> > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> > > Interests:
> > > > > distributed systems, scalable systems and platforms for big data
> > > > > processing, concurrent algorithms and data structures,
> > > > >
> > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> > > Sunnyvale,
> > > > > California. Interests: algorithms, theoretical and applied
> > mathematics,
> > > > > encoding and compression theory, theoretical and applied
> performance
> > > > > optimization.
> > > > >
> > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs,
> Palo
> > > Alto,
> > > > > California. Manages the algorithms group at Amazon AI. We build
> > > scalable
> > > > > machine learning systems and algorithms which are used both
> > internally
> > > and
> > > > > externally by customers of SageMaker, AWS's flagship machine
> learning
> > > > > platform.
> > > > >
> > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> Interests:
> > > > > Computational advertising, machine learning, speech recognition,
> > > > > data-driven analysis, large scale experimentation, big data,
> > > stream/complex
> > > > > event processing
> > > > >
> > > > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> > > Science,
> > > > > Georgetown University, Washington D.C. Interests: algorithms and
> > > > > computational complexity, complexity theory, quantum algorithms,
> > > private
> > > > > data analysis, and learning theory, developing efficient streaming
> > and
> > > > > sketching algorithms
> > > > >
> > > > > ==== Engineers That Love Science ====
> > > > >
> > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> > > Interests:
> > > > > design and implementation of data storing and data processing
> > > (distributed)
> > > > > systems, performance optimization, CPU performance, mechanical
> > > sympathy,
> > > > > JVM performance, API design, databases, (concurrent) data
> structures,
> > > > > memory management, garbage collection algorithms, language design
> and
> > > > > runtimes (their tradeoffs), distributed systems (cloud) efficiency,
> > > Linux,
> > > > > code quality, code transformation, pure functional programming
> > models,
> > > > > Haskell.
> > > > >
> > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
> founder
> > > of
> > > > > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > > > > streaming algorithms, mathematics, computer science, high quality
> and
> > > high
> > > > > performance code for the analysis of massive data, bridging the
> > divide
> > > > > between theory and practice.
> > > > >
> > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> Sunnyvale,
> > > > > California. Interests: applied mathematics, computer science, big
> > data,
> > > > > distributed systems.
> > > > >
> > > > > === Introduction to Additional Interested Contributors ===
> > > > >
> > > > > These folks have been intermittently involved and contributed, but
> > are
> > > > > strong supporters of this project.
> > > > >
> > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > >
> > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> > > Science,
> > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > > > approximation, streaming algorithms, randomized linear algebra.
> > > > >
> > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> > > Computer
> > > > > Science, Research Instructor, Princeton University. Interests:
> > > algorithmic
> > > > > foundations of data science and machine learning, efficient methods
> > for
> > > > > processing and understanding large datasets, often working at the
> > > > > intersection of theoretical computer science, numerical linear
> > > algebra, and
> > > > > optimization.
> > > > >
> > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer
> > > Science,
> > > > > Professor, Warwick University, Warwick, England. Interests: all
> > > aspects of
> > > > > the "data lifecycle", from data collection and cleaning, through
> > > mining and
> > > > > analytics. (Professor Cormode is one of the world’s leading
> > scientists
> > > in
> > > > > sketching algorithms)
> > > > >
> > > > > === Alignment ===
> > > > >
> > > > > The DataSketches library already provides integrations and example
> > > code for
> > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into
> > > Apache
> > > > > Druid.
> > > > >
> > > > > == Known Risks ==
> > > > >
> > > > > The following subsections are specific risks that have been
> > identified
> > > by
> > > > > the ASF that need to be addressed.
> > > > >
> > > > > === Risk: Orphaned Products ===
> > > > >
> > > > > The DataSketches library is presently used by a number of
> > > organizations,
> > > > > from small startups to Fortune 100 companies, to construct
> production
> > > > > pipelines that must process and analyze massive data. Yahoo has a
> > > long-term
> > > > > commitment to continue to advance the DataSketches library;
> moreover,
> > > > > DataSketches is seeing increasing interest, development, and
> adoption
> > > from
> > > > > many diverse organizations from around the world. Due to its
> growing
> > > > > adoption, we feel it is quite unlikely that this project would
> become
> > > > > orphaned.
> > > > >
> > > > > === Risk: Inexperience with Open Source ===
> > > > >
> > > > > Yahoo believes strongly in open source and the exchange of
> > information
> > > to
> > > > > advance new ideas and work. Examples of this commitment are active
> > open
> > > > > source projects such as those mentioned above. With DataSketches,
> we
> > > have
> > > > > been increasingly open and forward-looking; we have published a
> > number
> > > of
> > > > > papers about breakthrough developments in the science of streaming
> > > > > algorithms (mentioned above) that also reference the DataSketches
> > > library.
> > > > > Our submission to the Apache Software Foundation is a logical
> > > extension of
> > > > > our commitment to open source software.
> > > > >
> > > > > Key committers at Yahoo with strong open source backgrounds include
> > > Aaron
> > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> > Andrews
> > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call,
> > Daryn
> > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
> > Hillel,
> > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > Perez-Sorrosal,
> > > Gil
> > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> > > Penick,
> > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> > > Kihwal
> > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> Trelinski,
> > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > Natkovich,
> > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo,
> > > Ryan
> > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan,
> > Sri
> > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > > > >
> > > > > All of our core developers are committed to learn about the Apache
> > > process
> > > > > and to give back to the community.
> > > > >
> > > > > === Risk: Homogeneous Developers ===
> > > > >
> > > > > The majority of committers in this proposal belong to Yahoo due to
> > the
> > > fact
> > > > > that DataSketches has emerged from an internal Yahoo project. This
> > > proposal
> > > > > also includes developers and contributors from other companies, and
> > > who are
> > > > > actively involved with other Apache projects, such as Druid.  We
> > > expect our
> > > > > entry into incubation will allow us to expand the number of
> > > individuals and
> > > > > organizations participating in DataSketches development.
> > > > >
> > > > > === Risk: Reliance on Salaried Developers ===
> > > > >
> > > > > Because the DataSketches library originated within Yahoo, it has
> been
> > > > > developed primarily by salaried Yahoo developers and we expect that
> > to
> > > > > continue to be the case near term. However, since we placed this
> > > library
> > > > > into open-source we have had a number of significant contributions
> > from
> > > > > engineers and scientists from outside of Yahoo. We expect our
> > reliance
> > > on
> > > > > Yahoo salaried developers will decrease over time. Nonetheless,
> Yahoo
> > > is
> > > > > committed to continue its strong support of this important project.
> > > > >
> > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > >
> > > > > DataSketches already directly interoperates with or utilizes
> several
> > > > > existing Apache projects.
> > > > >
> > > > > * Build
> > > > >    * Apache Maven
> > > > >
> > > > > * Integrations and adaptors for the following projects naturally
> have
> > > them
> > > > > as dependencies
> > > > >    * Apache Hive
> > > > >    * Apache Pig
> > > > >    * Apache Druid
> > > > >    * Apache Spark
> > > > >
> > > > > * Additional dependencies for the above integrations and adaptors
> > > include
> > > > >    * Apache Hadoop
> > > > >    * Apache Commons (Math)
> > > > >
> > > > > There is no other Apache project that we are aware of that
> duplicates
> > > the
> > > > > functionality of the DataSketches library.
> > > > >
> > > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > > >
> > > > > With this proposal we are not seeking attention or publicity.
> Rather,
> > > we
> > > > > firmly believe in the DataSketches library and concept and the
> > ability
> > > to
> > > > > make the DataSketches library a powerful, yet simple-to-use toolkit
> > for
> > > > > data processing. While the DataSketches library has been open
> source,
> > > we
> > > > > believe putting code on GitHub can only go so far. We see the
> Apache
> > > > > community, processes, and mission as critical for ensuring the
> > > DataSketches
> > > > > library is truly community-driven, positively impactful, and
> > innovative
> > > > > open source software. While Yahoo has taken a number of steps to
> > > advance
> > > > > its various open source projects, we believe the DataSketches
> library
> > > > > project is a great fit for the Apache Software Foundation due to
> its
> > > focus
> > > > > on data processing and its relationships to existing ASF projects.
> > > > >
> > > > > === Risk: Cryptography ===
> > > > >
> > > > > DataSketches does not contain any cryptographic code and is not a
> > > > > cryptographic product.
> > > > >
> > > > > == Documentation ==
> > > > >
> > > > > The following documentation is relevant to this proposal. Relevant
> > > portions
> > > > > of the documentation will be contributed to the Apache DataSketches
> > > > > project.
> > > > >
> > > > > * DataSketches website: https://datasketches.github.io.
> > > > >
> > > > > * DataSketches website repository:
> > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > >
> > > > > We will need an apache website for this documentation similar to
> > > > >
> > > > > * https://datasketches.apache.org
> > > > >
> > > > > == Initial Source ==
> > > > >
> > > > > The initial source for DataSketches which we will submit to the
> > Apache
> > > > > Foundation will include a number of repositories which are
> currently
> > > hosted
> > > > > under the GitHub.com/datasketches organization:
> > > > >
> > > > > All github.com/datasketches repositories including:
> > > > >
> > > > > * Java
> > > > >    * sketches-core: This repository has the core sketching classes,
> > > which
> > > > > are leveraged by some of the other repositories. This repository
> has
> > no
> > > > > external dependencies outside of the DataSketches/memory
> repository,
> > > Java
> > > > > and TestNG for unit tests. This code is versioned and the latest
> > > release
> > > > > can be obtained from Maven Central.
> > > > >    * memory: Low level, high-performance memory data-structure
> > > management
> > > > > primarily for off-heap.
> > > > >    * sketches-android: This is a new repository dedicated to
> sketches
> > > > > designed to be run in a mobile client, such as a cell phone. It is
> > > still in
> > > > > development and should be considered experimental.
> > > > >    * sketches-hive: This repository contains Hive UDFs and UDAFs
> for
> > > use
> > > > > within Hadoop grid environments. This code has dependencies on
> > > > > sketches-core as well as Hadoop and Hive. Users of this code are
> > > advised to
> > > > > use Maven to bring in all the required dependencies. This code is
> > > versioned
> > > > > and the latest release can be obtained from Maven Central.
> > > > >    * sketches-pig: This repository contains Pig User Defined
> > Functions
> > > > > (UDF) for use within Hadoop grid environments. This code has
> > > dependencies
> > > > > on sketches-core as well as Hadoop and Pig. Users of this code are
> > > advised
> > > > > to use Maven to bring in all the required dependencies. This code
> is
> > > > > versioned and the latest release can be obtained from Maven
> Central.
> > > > >    * sketches-vector: This is a new repository dedicated to
> sketches
> > > for
> > > > > vector and matrix operations. It is still somewhat experimental.
> > > > >    * characterization: This relatively new repository is for code
> > that
> > > we
> > > > > use to characterize the accuracy and speed performance of the
> > sketches
> > > in
> > > > > the library and is constantly being updated. Examples of the job
> > > command
> > > > > files used for various tests can be found in the src/main/resources
> > > > > directory. Some of these tests can run for hours depending on its
> > > > > configuration.
> > > > >    * experimental: This repository is an experimental staging area
> > for
> > > code
> > > > > that will eventually end up in another repository. This code is not
> > > > > versioned and not registered with Maven Central.
> > > > >    * sketches-misc: Demos and other code not related to production
> > > > > deployment
> > > > >
> > > > > * C++ and Python
> > > > >    * sketches-core-cpp: This is the C++/Python companion to the
> Java
> > > > > sketches-core. These implementations are binary compatible with
> their
> > > > > counterparts in Java. In other words, a sketch created and stored
> in
> > > C++
> > > > > can be opened and read in Java and visa-versa. This site also has
> our
> > > > > Python adaptors that basically wrap the C++ implementations, making
> > the
> > > > > high performance C++ implementations available from Python.
> > > > >    * sketches-postgres: This site provides the postgres-specific
> > > adaptors
> > > > > that wrap the C++ implementations making them available to the
> > Postgres
> > > > > database users.
> > > > >    * characterization-cpp: This is the C++/Python companion to the
> > Java
> > > > > characterization repository.
> > > > >    * experimental-cpp: This repository is an experimental staging
> > area
> > > for
> > > > > C++ code that will eventually end up in another repository.
> > > > >
> > > > > * Command-Line Tools
> > > > >    * sketches-cmd
> > > > >    * homebrew-sketches
> > > > >    * homebrew-sketches-cmd
> > > > >
> > > > > These projects have always been Apache 2.0 licensed. We intend to
> > > bundle
> > > > > all of these repositories since they are all complementary and
> should
> > > be
> > > > > maintained in one project. Prior to our submission, we will combine
> > > all of
> > > > > these projects into a new git repository.
> > > > >
> > > > > == Source and Intellectual Property Submission Plan ==
> > > > >
> > > > > Contributors to the DataSketches project have also signed the Yahoo
> > > > > Individual Contributor License Agreement (
> > > https://yahoocla.herokuapp.com/
> > > > > in order to contribute to the project.
> > > > >
> > > > > With respect to trademark rights, Yahoo does not hold a trademark
> on
> > > the
> > > > > phrase “DataSketches.” Based on feedback and guidance we receive
> > > during the
> > > > > incubation process, we are open to renaming the project if
> necessary
> > > for
> > > > > trademark or other concerns, but we would prefer not to have to do
> > > that.
> > > > >
> > > > > == External Dependencies ==
> > > > >
> > > > > All external dependencies are licensed under an Apache 2.0 or
> > > > > Apache-compatible license. As we grow the DataSketches community we
> > > will
> > > > > configure our build process to require and validate all
> contributions
> > > and
> > > > > dependencies are licensed under the Apache 2.0 license or are under
> > an
> > > > > Apache-compatible license.
> > > > >
> > > > > == Required Resources ==
> > > > >
> > > > > === Mailing Lists ===
> > > > >
> > > > > We currently use a mix of mailing lists. We will migrate our
> existing
> > > > > mailing lists to the following:
> > > > >
> > > > > * dev@datasketches.incubator.apache.org
> > > > >
> > > > > * user@datasketches.incubator.apache.org
> > > > >
> > > > > * private@datasketches.incubator.apache.org
> > > > >
> > > > > * commits@datasketches.incubator.apache.org
> > > > >
> > > > > === Source Control ===
> > > > >
> > > > > The DataSketches team currently uses Git and would like to continue
> > to
> > > do
> > > > > so. We request a Git repository for DataSketches with mirroring to
> > > GitHub
> > > > > enabled similar the following:
> > > > >
> > > > > * https://github.com/apache/incubator-datasketches.git
> > > > >
> > > > > === Issue Tracking ===
> > > > >
> > > > > We request the creation of an Apache-hosted JIRA. The DataSketches
> > > project
> > > > > is currently using the public GitHub issue tracker and the public
> > > Google
> > > > > Groups forum/sketches-user for issue tracking and discussions. We
> > will
> > > > > migrate and combine from these two sources to the Apache JIRA.
> > > > >
> > > > > Proposed Jira ID: DATASKETCHES
> > > > >
> > > > > == Initial Committers ==
> > > > >
> > > > > The following list of individuals have been extremely active in our
> > > > > community and should have write (commit) permissions to the
> > repository.
> > > > >
> > > > > * Eshcar Hillel                      [eshcar at verizonmedia dot
> com]
> > > > >
> > > > > * Kevin Lang                    [langk at verizonmedia dot com]
> > > > >
> > > > > * Roman Leventov              [roman.leventov at c.metamarkets dot
> > com]
> > > > >
> > > > > * Edo Liberty                   [libertye at amazon dot com]
> > > > >
> > > > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > > > >
> > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> > > [leerho
> > > > > at gmail dot com]
> > > > >
> > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > > > >
> > > > > * Justin Thaler                 [justin.thaler at georgetown dot
> edu]
> > > > >
> > > > > == Affiliations ==
> > > > >
> > > > > The initial committers are from four organizations: Yahoo, Amazon,
> > > > > Georgetown University, and Metamarkets/Snap.
> > > > >
> > > > > === Champion ===
> > > > > (Recommended to me: )
> > > > >
> > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > apache
> > > > > dot org]
> > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > >
> > > > > === Nominated Mentors ===
> > > > > (Recommended to me: )
> > > > >
> > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > > apache
> > > > > dot org]
> > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > >
> > > > > === Sponsoring Entity ===
> > > > >
> > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > >
> > > > > * Apache Druid. The incubating Apache Druid project might also be a
> > > logical
> > > > > sponsor. However, DataSketches has applications in many areas of
> > > computing
> > > > > outside of Druid so our preference and recommendation is that
> > > DataSketches
> > > > > would ultimately be a top-level Apache project.
> > > > >
> > > > > ________________
> > > > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> > > acquired
> > > > > AOL. The merged entity was originally called Oath, Inc., but has
> > > recently
> > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> > Verizon,
> > > > > Inc.  Since Yahoo is the more recognized name, references in this
> > > document
> > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > >
> > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <ke...@apache.org>
> > > wrote:
> > > > >
> > > > > > The subject line has me interested already. Follow examples like
> > this
> > > > > > maybe?
> > > > > >
> > > > > > 1.
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > 2.
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > >
> > > > > > Kenn
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com> wrote:
> > > > > >
> > > > > > > I'll try again ... :)
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > ted.dunning@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> It didn't make it again
> > > > > > >>
> > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com>
> wrote:
> > > > > > >>
> > > > > > >> > I'm not sure the attached document made it through.
> > > > > > >> >
> > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com>
> > > wrote:
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> general-unsubscribe@incubator.apache.org
> > > > > > > For additional commands, e-mail:
> > general-help@incubator.apache.org
> > > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
> --
> From my cell phone.
>

Re: DataSketches Proposal

Posted by leerho <le...@gmail.com>.

Yes I will try that tomorrow.

On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <ke...@apache.org> wrote:

> Can you share the Google doc with the proposal? Per Ted's advice, we can
> iterate quickly there and move it to the wiki when it becomes a bit more
> stable.
>
> Kenn
>
> On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com>
> wrote:
>
> > Thanks for the offer.  i am a neophyte at this process and email app!   I
> > could use a lot of help getting this off the ground!  Also, I'm not sure
> > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
> >
> > Lee.
> >
> > On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > > Nice.
> > >
> > > I would very much like to help mentor this project, though you already
> > have
> > > a couple good ones.
> > >
> > > I concur with incubator as sponsoring entity.
> > >
> > > Kenn (VP Apache Beam)
> > >
> > > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> > >
> > > > I didn't realize that this mail list does not accept PDF files,
> > apparently
> > > > only text.  So let me try one more time ... :)  Please let me know if
> > > > this works!
> > > >
> > > >
> > > > = Apache DataSketches Proposal[1] =
> > > >
> > > > == Abstract ==
> > > >
> > > > DataSketches.GitHub.io is an open source, high-performance library
> of
> > > > stochastic streaming algorithms commonly called "sketches" in the
> data
> > > > sciences. Sketches are small, stateful programs that process massive
> > data
> > > > as a stream and can provide approximate answers, with mathematical
> > > > guarantees, to computationally difficult queries orders-of-magnitude
> > faster
> > > > than traditional, exact methods.
> > > >
> > > > This proposal is to move DataSketches to the Apache Software
> > > > Foundation(ASF) transferring ownership of its copyright intellectual
> > > > property to the ASF.  Thereafter, DataSketches would be officially
> > known as
> > > > Apache DataSketches and its evolution and governance would come under
> > the
> > > > rules and guidance of the ASF.
> > > >
> > > > == Introduction ==
> > > >
> > > > The DataSketches library contains carefully crafted implementations
> of
> > > > sketch algorithms that meet rigorous standards of quality and
> > performance
> > > > and provide capabilities required for large-scale production systems
> > that
> > > > must process and analyze massive data. The DataSketches core
> > repository is
> > > > written in Java with a parallel core repository written in C++ that
> > > > includes Python wrappers. The DataSketches library also includes
> > special
> > > > repositories for extending the core library for Apache Hive and
> Apache
> > Pig.
> > > > The sketches developed in the different languages share a common
> binary
> > > > storage format so that sketches created and stored in Java, for
> > example,
> > > > can be fully used in C++, and visa versa.  Because the stored sketch
> > > > "images" are just a "blob" of bytes (similar to picture images), they
> > can
> > > > be shared across many different systems, languages and platforms.
> > > >
> > > > The DataSketches documentation website,
> https://datasketches.github.io
> > ,
> > > > includes general tutorials, a comprehensive research section with
> > > > references to relevant academic papers, extensive examples for using
> > the
> > > > core library directly as well as examples for accessing the library
> in
> > > > Hive, Pig, and Apache Spark.
> > > >
> > > > The DataSketches library also includes a characterization repository
> > for
> > > > long running test programs that are used for studying accuracy and
> > > > performance of these sketches over wide ranges of input variables.
> The
> > data
> > > > produced by these programs is used for generating the many
> performance
> > > > plots contained in the documentation website and for academic
> > > > publications.
> > > >
> > > > The code repositories used for production are versioned and published
> > to
> > > > Maven Central on periodic intervals as the library evolves.
> > > >
> > > > The DataSketches library also includes several experimental
> > repositories
> > > > for use-cases outside the large-scale systems environments, such as
> > > > sketches for mobile, IoT devices (Android), command-line access of
> the
> > > > sketch library, and an experimental repository for vector-based
> > sketches
> > > > that performs approximate Singular Value Decomposition (SVD) analysis
> > that
> > > > could potentially be used in Machine Learning (ML) applications.
> > > >
> > > > == Background ==
> > > >
> > > > The DataSketches library was started in 2012 as internal Yahoo
> project
> > to
> > > > dramatically reduce time and resources required for distinct (unique)
> > > > counting.  An extensive search on the Internet at the time yielded a
> > number
> > > > of theoretical papers on stochastic streaming algorithms with
> > pseudocode
> > > > examples, but we did not find any usable open-source code of the
> > quality we
> > > > felt we needed for our internal production systems.  So we started a
> > small
> > > > project (one person) to develop our own sketches working directly
> from
> > > > published theoretical papers.
> > > >
> > > > The DataSketches library was designed from the start with the
> > objective of
> > > > making these algorithms, usually only described in theoretical
> papers,
> > > > easily accessible to systems developers for use in our internal
> > production
> > > > systems. By necessity, the code had to be of the highest quality and
> > > > thoroughly tested. The wide variety of our internal production
> systems
> > > > drove the requirement that the sketch implementations had to have an
> > > > absolute minimum of external, run-time dependencies in order to
> > simplify
> > > > integration and troubleshooting.
> > > >
> > > > Our internal experiments demonstrated dramatic positive impact on the
> > > > performance of our systems.  As a result, the DataSketches library
> > quickly
> > > > evolved to include different types of sketches for different types of
> > > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > > > quantile/histogram algorithms, and weighted and unweighted sampling
> > > > algorithms.
> > > >
> > > > We quickly discovered that developing these sketch algorithms to be
> > truly
> > > > robust in production environments is quite difficult and requires
> deep
> > > > understanding of the underlying mathematics and statistics as well as
> > > > extensive experience in developing high quality code for 24/7
> > production
> > > > systems. This is a difficult combination of skills for any one
> > organization
> > > > to collect and maintain over time. It became clear that this
> technology
> > > > needed a community larger than Yahoo to evolve.  In November, 2015,
> > this
> > > > factor, along with Yahoo’s strong experience and support of open
> > source,
> > > > led to the decision to open source this technology under an Apache
> 2.0
> > > > license on GitHub. Since that time our community has expanded
> > considerably
> > > > and the key contributors to this effort includes leading research
> > > > scientists from a number of universities as well as practitioners and
> > > > researchers from a number of major corporations. The core of this
> > group is
> > > > very active as we meet weekly to discuss research directions and
> > > > engineering priorities.
> > > >
> > > > It is important to note that our internal systems at Yahoo use the
> > current
> > > > public GitHub open source DataSketches library and not an internal
> > version
> > > > of the code.
> > > >
> > > > The close collaboration of scientific research and engineering
> > development
> > > > experience with actual massive-data processing systems has also
> > produced
> > > > new research publications in the field of stochastic streaming
> > algorithms,
> > > > for example:
> > > >
> > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> > Rhodes, and
> > > > Justin Thaler. A high-performance algorithm for identifying frequent
> > items
> > > > in data streams. In ACM IMC 2017.
> > > >
> > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > > > framework for estimating stream expression cardinalities. In
> *EDBT/ICDT
> > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > >
> > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings
> > ‘16,
> > > > pages 845-854, 2016.
> > > >
> > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78,
> > 2016.
> > > >
> > > > * Kevin J Lang. Back to the future: an even more nearly optimal
> > cardinality
> > > > estimation algorithm. arXiv preprint
> https://arxiv.org/abs/1708.06839,
> > > > 2017.
> > > >
> > > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
> > > > Proceedings ‘13, pages 581– 588, 2013.
> > > >
> > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> > Ullman.
> > > > Space lower bounds for itemset frequency sketches. In ACM PODS
> > Proceedings
> > > > ‘16, pages 441–454, 2016.
> > > >
> > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> Hierarchical
> > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> > Proceedings
> > > > ‘12, pages 160–174, 2012.
> > > >
> > > > == The Rationale for Sketches ==
> > > >
> > > > In the analysis of big data there are often problem queries that
> don’t
> > > > scale because they require huge compute resources and time to
> generate
> > > > exact results. Examples include count distinct, quantiles, most
> > frequent
> > > > items, joins, matrix computations, and graph analysis.
> > > >
> > > > If we can loosen the requirement of “exact” results from our queries
> > and be
> > > > satisfied with approximate results, within some well understood
> bounds
> > of
> > > > error, there is an entire branch of mathematics and data science that
> > has
> > > > evolved around developing algorithms that can produce approximate
> > results
> > > > with mathematically well-defined error properties.
> > > >
> > > > With the additional requirements that these algorithms must be small
> > > > (compared to the size of the input data), sublinear (the size of the
> > sketch
> > > > must grow at a slower rate than the size of the input stream),
> > streaming
> > > > (they can only touch each data item once), and mergeable (suitable
> for
> > > > distributed processing), defines a class of algorithms that can be
> > > > described as small, stochastic, streaming, sublinear mergeable
> > algorithms,
> > > > commonly called sketches (they also have other names, but we will use
> > the
> > > > term sketches from here on).
> > > >
> > > > To be truly streaming and be able to process data in a single pass,
> > > > sketches must make absolute minimum assumptions about the input
> stream.
> > > > This is critically important, as there is no “second chance” to
> > process the
> > > > data.
> > > >
> > > > For example, sketches should not make assumptions about the order of
> > stream
> > > > items, the stream length, the dynamic range of values, or the
> > distribution
> > > > of item occurrence frequencies. Sketches should be tolerant of NaNs,
> > Nulls
> > > > and empty objects. About the only thing that the sketch needs to know
> > about
> > > > the stream is how to extract items from it and what type the item is,
> > e.g.,
> > > > is it a numeric value or a string.
> > > >
> > > > As far as the sketch is concerned, the input stream is a sequence of
> > items
> > > > in some unknown random order with unknown random values.
> > > >
> > > > The sketch is essentially a complex state machine and combined with
> the
> > > > random input stream defines a stochastic process. We then apply
> > > > probabilistic methods to interpret the states of the stochastic
> > process in
> > > > order to extract useful information about the input stream itself.
> The
> > > > resulting information will be approximate, but we also use additional
> > > > probabilistic methods to extract an estimate of the likely
> probability
> > > > distribution of error.
> > > >
> > > > There is a significant scientific contribution here that is defining
> > the
> > > > state machine, understanding the resulting stochastic process,
> > developing
> > > > the probabilistic methods, and proving mathematically, that it all
> > works!
> > > > This is why the scientific contributors to this project are a
> critical
> > and
> > > > strategic component to our success.  The development engineers
> > translate
> > > > the concepts of the proposed state machine and probabilistic methods
> > into
> > > > production-quality code. Even more important, they work closely with
> > the
> > > > scientists, feeding back system and user requirements, which leads
> not
> > only
> > > > to superior product design, but to new science as well.  A number of
> > > > scientific papers our members have published (see above) is a direct
> > result
> > > > of this close collaboration.
> > > >
> > > > Because sketches are small they can be processed extremely fast,
> often
> > many
> > > > orders-of-magnitude faster than traditional exact computations. For
> > > > interactive queries there may not be other viable alternatives, and
> in
> > the
> > > > case of real-time analysis, sketches are the only known solution.
> > > >
> > > > For any system that needs to extract useful information from massive
> > data
> > > > sketches are essential tools that should be tightly integrated into
> the
> > > > system’s analysis capabilities. This technology has helped Yahoo
> > > > successfully reduce data processing times from days to hours or
> > minutes on
> > > > a number of its internal platforms and has enabled subsecond queries
> on
> > > > real-time platforms that would have been infeasible without sketches.
> > > > The Rationale for Apache DataSketches
> > > > Other open source implementations of sketch algorithms can be found
> on
> > the
> > > > Internet. However, we have not yet found any open source
> > implementations
> > > > that are as comprehensive, engineered with the quality required for
> > > > production systems, and with usable and guaranteed error properties.
> > Large
> > > > Internet companies, such as Google and Facebook, have published
> papers
> > on
> > > > sketching, however, their implementations of their published
> > algorithms are
> > > > proprietary and not available as open source.
> > > >
> > > > The DataSketches library already provides integrations with a number
> of
> > > > major Apache data processing platforms such as Apache Hive, Apache
> Pig,
> > > > Apache Spark and Apache Druid, and is also integrated with a number
> of
> > > > other open source data processing platforms such as Splice Machine,
> > GCHQ
> > > > Gaffer and PostgreSQL.
> > > >
> > > > We believe that having DataSketches as an Apache project will provide
> > an
> > > > immediate, worthwhile, and substantial contribution to the open
> source
> > > > community, will have a better opportunity to provide a meaningful
> > > > contribution to both the science and engineering of sketching
> > algorithms,
> > > > and integrate with other Apache projects.  In addition, this is a
> > > > significant opportunity for Apache to be the "go-to" destination for
> > users
> > > > that want to leverage this exciting technology.
> > > >
> > > > == Initial Goals ==
> > > >
> > > > We are breaking our initial goals into short-term (2-6 months) and
> > > > intermediate to long-term ( 6 months to 2 years):
> > > >
> > > > Our short-term goals include:
> > > >
> > > > * Understanding and adapting to the Apache development process and
> > > > structures.
> > > >
> > > > * Start refactoring codebase and move various DataSketches
> repositories
> > > > code to Apache Git repository.
> > > >
> > > > * Continue development of new features, functions, and fixes.
> > > >
> > > > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > > > developed and expanded.
> > > >
> > > >
> > > > The intermediate to long term goals include:
> > > >
> > > > * Completing the design and implementation of the C++ sketches to
> > > > complement what is already available in Java, and the Python wrappers
> > of
> > > > those C++ sketches.
> > > >
> > > > * Expanding the C++ build framework to include Windows and the
> popular
> > > > Linux variants.
> > > >
> > > > * Continued engagement with the scientific research community on the
> > > > development of new algorithms for computationally difficult problems
> > that
> > > > heretofore have not had a sketching solution.
> > > >
> > > > == Current Status ==
> > > >
> > > > The DataSketches GitHub project has been quite successful.  As of
> this
> > > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > > Repository Manager at https://oss.sonatype.org has grown by nearly a
> > > > factor
> > > > of 10 over the past year to about 55 thousand per month. The
> > > > DataSketches/sketches-core repository has about 560 stars and 141
> > forks,
> > > > which is pretty good for a highly specialized library.
> > > >
> > > > === Development Practices ===
> > > >
> > > > ==== Source Control ====
> > > >
> > > > All of our developers have extensive experience with Git version
> > control
> > > > and follow accepted practices for use of Pull Requests (PRs), code
> > reviews
> > > > and commits to master, for example.
> > > >
> > > > ==== Testing ====
> > > >
> > > > Sketches, by their nature are probabilistic programs and don’t
> > necessarily
> > > > behave deterministically.  For some of the sketches we intentionally
> > insert
> > > > random noise into the code as this gives us the mathematical
> properties
> > > > that we need to guarantee accuracy.  This can make the behavior of
> > these
> > > > algorithms quite unintuitive and provides significant challenges to
> the
> > > > developer who wishes to test these algorithms for correctness. As a
> > result,
> > > > our testing strategy includes two major components: unit tests, and
> > > > characterization tests.
> > > >
> > > > ===== Unit Testing =====
> > > >
> > > > Our unit tests are primarily quick tests to make sure that we
> exercise
> > all
> > > > critical paths in the code and that key branches are executed
> > correctly. It
> > > > is important that they execute relatively fast as they are generally
> > run on
> > > > every code build. The sketches-core repository alone has about 22
> > thousand
> > > > statements, over 1300 unit tests and code coverage of about 98.2% as
> > > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > > repositories that are used in production that they have code coverage
> > > > greater than 90%.
> > > >
> > > > ===== Characterization Testing =====
> > > >
> > > > In order to test the probabilistic methods that are used to interpret
> > the
> > > > stochastic behaviors of our sketches we have a separate
> > characterization
> > > > repository that is dedicated to this.  To measure accuracy, for
> > example,
> > > > requires running thousands of trials at each of many different points
> > along
> > > > the domain axis. Each trial compares its estimated results against a
> > known
> > > > exact result producing an error for that trial.  These error
> > measurements
> > > > are then fed into our Quantiles sketch to capture the actual
> > distribution
> > > > of error at that point along the axis. We then select quantile
> contours
> > > > across all the distributions at points along the axis.  These
> contours
> > can
> > > > then be plotted to reveal the shape of the actual error distribution.
> > These
> > > > distributions are not at all Gaussian, in fact they can be quite
> > complex.
> > > > Nonetheless, these distributions are then checked against our
> > statistical
> > > > guarantees inherent to the specific sketch algorithm and its
> > parameters.
> > > > There are many examples of these characterization error distributions
> > on
> > > > our website. The runtimes of these tests can be very long and can
> range
> > > > from many minutes to hours, and some can run for days.  Currently, we
> > have
> > > > separate characterization repositories for Java and C++ / Python.
> > > >
> > > > It is our goal that we perform this characterization analysis for all
> > of
> > > > our sketches.  By definition, the code that runs these
> characterization
> > > > tests is open-source so others can run these tests as well.  We do
> not
> > have
> > > > formal releases of this code (because it is not production code) and
> > it is
> > > > not published to Maven Central.
> > > >
> > > > === Meritocracy ===
> > > >
> > > > DataSketches was initially developed based on requirements within
> > Yahoo. As
> > > > a project on GitHub, DataSketches has received contributions from
> > numerous
> > > > individual developers from around the world, dedicated research work
> > from
> > > > senior scientists at Amazon and Visa, and academic researchers from
> > > > Georgetown University, Princeton, and MIT.
> > > >
> > > > As a project under incubation, we are committed to expanding our
> > effort to
> > > > build an environment which supports a meritocracy. We are focused on
> > > > engaging the community and other related projects for support and
> > > > contributions. Moreover, we are committed to ensure contributors and
> > > > committers to DataSketches come from a broad mix of organizations
> > through a
> > > > merit-based decision process during incubation. We believe strongly
> in
> > the
> > > > DataSketches premise that fulfills the concept of a well engineered
> and
> > > > scientifically rigorous library that implements these powerful
> > algorithms
> > > > and are committed to growing an inclusive community of DataSketches
> > > > contributors and users.
> > > >
> > > > === Community ===
> > > >
> > > > Yahoo has a long history and active engagement in the Open Source
> > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
> Panoptes,
> > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> > gifshot,
> > > > fluxible, as well as the creation, contribution and incubation of
> many
> > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> > Zookeeper,
> > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > >
> > > > Every day, DataSketches is actively used by a organizations and
> > > > institutions around the world for batch and stream processing of
> data.
> > We
> > > > believe acceptance will allow us to consolidate existing
> > > > DataSketches-related work, grow the DataSketches community, and
> deepen
> > > > connections between DataSketches and other open source projects.
> > > >
> > > > === Introduction to the Core Developers & Contributors ===
> > > >
> > > > The core developers and contributors for DataSketches are from
> diverse
> > > > backgrounds, but primarily are scientists that love engineering and
> > > > engineers that love science. A large part of the value we bring comes
> > from
> > > > this synthesis.  These individuals have already contributed
> > substantially
> > > > to the code, algorithms, and/or mathematical proofs that form the
> > basis of
> > > > the library.
> > > >
> > > > This core group also form the Initial Committers with write
> > permissions to
> > > > the repository. Those marked with (*) Meet weekly to plan the
> research
> > and
> > > > engineering direction of the project.
> > > >
> > > > ==== Scientists That Love Engineering ====
> > > >
> > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> > Interests:
> > > > distributed systems, scalable systems and platforms for big data
> > > > processing, concurrent algorithms and data structures,
> > > >
> > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> > Sunnyvale,
> > > > California. Interests: algorithms, theoretical and applied
> mathematics,
> > > > encoding and compression theory, theoretical and applied performance
> > > > optimization.
> > > >
> > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo
> > Alto,
> > > > California. Manages the algorithms group at Amazon AI. We build
> > scalable
> > > > machine learning systems and algorithms which are used both
> internally
> > and
> > > > externally by customers of SageMaker, AWS's flagship machine learning
> > > > platform.
> > > >
> > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
> > > > Computational advertising, machine learning, speech recognition,
> > > > data-driven analysis, large scale experimentation, big data,
> > stream/complex
> > > > event processing
> > > >
> > > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> > Science,
> > > > Georgetown University, Washington D.C. Interests: algorithms and
> > > > computational complexity, complexity theory, quantum algorithms,
> > private
> > > > data analysis, and learning theory, developing efficient streaming
> and
> > > > sketching algorithms
> > > >
> > > > ==== Engineers That Love Science ====
> > > >
> > > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> > Interests:
> > > > design and implementation of data storing and data processing
> > (distributed)
> > > > systems, performance optimization, CPU performance, mechanical
> > sympathy,
> > > > JVM performance, API design, databases, (concurrent) data structures,
> > > > memory management, garbage collection algorithms, language design and
> > > > runtimes (their tradeoffs), distributed systems (cloud) efficiency,
> > Linux,
> > > > code quality, code transformation, pure functional programming
> models,
> > > > Haskell.
> > > >
> > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and founder
> > of
> > > > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > > > streaming algorithms, mathematics, computer science, high quality and
> > high
> > > > performance code for the analysis of massive data, bridging the
> divide
> > > > between theory and practice.
> > > >
> > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
> > > > California. Interests: applied mathematics, computer science, big
> data,
> > > > distributed systems.
> > > >
> > > > === Introduction to Additional Interested Contributors ===
> > > >
> > > > These folks have been intermittently involved and contributed, but
> are
> > > > strong supporters of this project.
> > > >
> > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > >
> > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> > Science,
> > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > > approximation, streaming algorithms, randomized linear algebra.
> > > >
> > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> > Computer
> > > > Science, Research Instructor, Princeton University. Interests:
> > algorithmic
> > > > foundations of data science and machine learning, efficient methods
> for
> > > > processing and understanding large datasets, often working at the
> > > > intersection of theoretical computer science, numerical linear
> > algebra, and
> > > > optimization.
> > > >
> > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer
> > Science,
> > > > Professor, Warwick University, Warwick, England. Interests: all
> > aspects of
> > > > the "data lifecycle", from data collection and cleaning, through
> > mining and
> > > > analytics. (Professor Cormode is one of the world’s leading
> scientists
> > in
> > > > sketching algorithms)
> > > >
> > > > === Alignment ===
> > > >
> > > > The DataSketches library already provides integrations and example
> > code for
> > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into
> > Apache
> > > > Druid.
> > > >
> > > > == Known Risks ==
> > > >
> > > > The following subsections are specific risks that have been
> identified
> > by
> > > > the ASF that need to be addressed.
> > > >
> > > > === Risk: Orphaned Products ===
> > > >
> > > > The DataSketches library is presently used by a number of
> > organizations,
> > > > from small startups to Fortune 100 companies, to construct production
> > > > pipelines that must process and analyze massive data. Yahoo has a
> > long-term
> > > > commitment to continue to advance the DataSketches library; moreover,
> > > > DataSketches is seeing increasing interest, development, and adoption
> > from
> > > > many diverse organizations from around the world. Due to its growing
> > > > adoption, we feel it is quite unlikely that this project would become
> > > > orphaned.
> > > >
> > > > === Risk: Inexperience with Open Source ===
> > > >
> > > > Yahoo believes strongly in open source and the exchange of
> information
> > to
> > > > advance new ideas and work. Examples of this commitment are active
> open
> > > > source projects such as those mentioned above. With DataSketches, we
> > have
> > > > been increasingly open and forward-looking; we have published a
> number
> > of
> > > > papers about breakthrough developments in the science of streaming
> > > > algorithms (mentioned above) that also reference the DataSketches
> > library.
> > > > Our submission to the Apache Software Foundation is a logical
> > extension of
> > > > our commitment to open source software.
> > > >
> > > > Key committers at Yahoo with strong open source backgrounds include
> > Aaron
> > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
> Andrews
> > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call,
> Daryn
> > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
> Hillel,
> > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> Perez-Sorrosal,
> > Gil
> > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> > Penick,
> > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> > Kihwal
> > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
> > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> Natkovich,
> > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo,
> > Ryan
> > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan,
> Sri
> > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > > >
> > > > All of our core developers are committed to learn about the Apache
> > process
> > > > and to give back to the community.
> > > >
> > > > === Risk: Homogeneous Developers ===
> > > >
> > > > The majority of committers in this proposal belong to Yahoo due to
> the
> > fact
> > > > that DataSketches has emerged from an internal Yahoo project. This
> > proposal
> > > > also includes developers and contributors from other companies, and
> > who are
> > > > actively involved with other Apache projects, such as Druid.  We
> > expect our
> > > > entry into incubation will allow us to expand the number of
> > individuals and
> > > > organizations participating in DataSketches development.
> > > >
> > > > === Risk: Reliance on Salaried Developers ===
> > > >
> > > > Because the DataSketches library originated within Yahoo, it has been
> > > > developed primarily by salaried Yahoo developers and we expect that
> to
> > > > continue to be the case near term. However, since we placed this
> > library
> > > > into open-source we have had a number of significant contributions
> from
> > > > engineers and scientists from outside of Yahoo. We expect our
> reliance
> > on
> > > > Yahoo salaried developers will decrease over time. Nonetheless, Yahoo
> > is
> > > > committed to continue its strong support of this important project.
> > > >
> > > > === Risk: Lack of Relationship to other Apache Products ===
> > > >
> > > > DataSketches already directly interoperates with or utilizes several
> > > > existing Apache projects.
> > > >
> > > > * Build
> > > >    * Apache Maven
> > > >
> > > > * Integrations and adaptors for the following projects naturally have
> > them
> > > > as dependencies
> > > >    * Apache Hive
> > > >    * Apache Pig
> > > >    * Apache Druid
> > > >    * Apache Spark
> > > >
> > > > * Additional dependencies for the above integrations and adaptors
> > include
> > > >    * Apache Hadoop
> > > >    * Apache Commons (Math)
> > > >
> > > > There is no other Apache project that we are aware of that duplicates
> > the
> > > > functionality of the DataSketches library.
> > > >
> > > > === Risk: An Excessive Fascination with the Apache Brand ===
> > > >
> > > > With this proposal we are not seeking attention or publicity. Rather,
> > we
> > > > firmly believe in the DataSketches library and concept and the
> ability
> > to
> > > > make the DataSketches library a powerful, yet simple-to-use toolkit
> for
> > > > data processing. While the DataSketches library has been open source,
> > we
> > > > believe putting code on GitHub can only go so far. We see the Apache
> > > > community, processes, and mission as critical for ensuring the
> > DataSketches
> > > > library is truly community-driven, positively impactful, and
> innovative
> > > > open source software. While Yahoo has taken a number of steps to
> > advance
> > > > its various open source projects, we believe the DataSketches library
> > > > project is a great fit for the Apache Software Foundation due to its
> > focus
> > > > on data processing and its relationships to existing ASF projects.
> > > >
> > > > === Risk: Cryptography ===
> > > >
> > > > DataSketches does not contain any cryptographic code and is not a
> > > > cryptographic product.
> > > >
> > > > == Documentation ==
> > > >
> > > > The following documentation is relevant to this proposal. Relevant
> > portions
> > > > of the documentation will be contributed to the Apache DataSketches
> > > > project.
> > > >
> > > > * DataSketches website: https://datasketches.github.io.
> > > >
> > > > * DataSketches website repository:
> > > > https://github.com/DataSketches/DataSketches.github.io
> > > >
> > > > We will need an apache website for this documentation similar to
> > > >
> > > > * https://datasketches.apache.org
> > > >
> > > > == Initial Source ==
> > > >
> > > > The initial source for DataSketches which we will submit to the
> Apache
> > > > Foundation will include a number of repositories which are currently
> > hosted
> > > > under the GitHub.com/datasketches organization:
> > > >
> > > > All github.com/datasketches repositories including:
> > > >
> > > > * Java
> > > >    * sketches-core: This repository has the core sketching classes,
> > which
> > > > are leveraged by some of the other repositories. This repository has
> no
> > > > external dependencies outside of the DataSketches/memory repository,
> > Java
> > > > and TestNG for unit tests. This code is versioned and the latest
> > release
> > > > can be obtained from Maven Central.
> > > >    * memory: Low level, high-performance memory data-structure
> > management
> > > > primarily for off-heap.
> > > >    * sketches-android: This is a new repository dedicated to sketches
> > > > designed to be run in a mobile client, such as a cell phone. It is
> > still in
> > > > development and should be considered experimental.
> > > >    * sketches-hive: This repository contains Hive UDFs and UDAFs for
> > use
> > > > within Hadoop grid environments. This code has dependencies on
> > > > sketches-core as well as Hadoop and Hive. Users of this code are
> > advised to
> > > > use Maven to bring in all the required dependencies. This code is
> > versioned
> > > > and the latest release can be obtained from Maven Central.
> > > >    * sketches-pig: This repository contains Pig User Defined
> Functions
> > > > (UDF) for use within Hadoop grid environments. This code has
> > dependencies
> > > > on sketches-core as well as Hadoop and Pig. Users of this code are
> > advised
> > > > to use Maven to bring in all the required dependencies. This code is
> > > > versioned and the latest release can be obtained from Maven Central.
> > > >    * sketches-vector: This is a new repository dedicated to sketches
> > for
> > > > vector and matrix operations. It is still somewhat experimental.
> > > >    * characterization: This relatively new repository is for code
> that
> > we
> > > > use to characterize the accuracy and speed performance of the
> sketches
> > in
> > > > the library and is constantly being updated. Examples of the job
> > command
> > > > files used for various tests can be found in the src/main/resources
> > > > directory. Some of these tests can run for hours depending on its
> > > > configuration.
> > > >    * experimental: This repository is an experimental staging area
> for
> > code
> > > > that will eventually end up in another repository. This code is not
> > > > versioned and not registered with Maven Central.
> > > >    * sketches-misc: Demos and other code not related to production
> > > > deployment
> > > >
> > > > * C++ and Python
> > > >    * sketches-core-cpp: This is the C++/Python companion to the Java
> > > > sketches-core. These implementations are binary compatible with their
> > > > counterparts in Java. In other words, a sketch created and stored in
> > C++
> > > > can be opened and read in Java and visa-versa. This site also has our
> > > > Python adaptors that basically wrap the C++ implementations, making
> the
> > > > high performance C++ implementations available from Python.
> > > >    * sketches-postgres: This site provides the postgres-specific
> > adaptors
> > > > that wrap the C++ implementations making them available to the
> Postgres
> > > > database users.
> > > >    * characterization-cpp: This is the C++/Python companion to the
> Java
> > > > characterization repository.
> > > >    * experimental-cpp: This repository is an experimental staging
> area
> > for
> > > > C++ code that will eventually end up in another repository.
> > > >
> > > > * Command-Line Tools
> > > >    * sketches-cmd
> > > >    * homebrew-sketches
> > > >    * homebrew-sketches-cmd
> > > >
> > > > These projects have always been Apache 2.0 licensed. We intend to
> > bundle
> > > > all of these repositories since they are all complementary and should
> > be
> > > > maintained in one project. Prior to our submission, we will combine
> > all of
> > > > these projects into a new git repository.
> > > >
> > > > == Source and Intellectual Property Submission Plan ==
> > > >
> > > > Contributors to the DataSketches project have also signed the Yahoo
> > > > Individual Contributor License Agreement (
> > https://yahoocla.herokuapp.com/
> > > > in order to contribute to the project.
> > > >
> > > > With respect to trademark rights, Yahoo does not hold a trademark on
> > the
> > > > phrase “DataSketches.” Based on feedback and guidance we receive
> > during the
> > > > incubation process, we are open to renaming the project if necessary
> > for
> > > > trademark or other concerns, but we would prefer not to have to do
> > that.
> > > >
> > > > == External Dependencies ==
> > > >
> > > > All external dependencies are licensed under an Apache 2.0 or
> > > > Apache-compatible license. As we grow the DataSketches community we
> > will
> > > > configure our build process to require and validate all contributions
> > and
> > > > dependencies are licensed under the Apache 2.0 license or are under
> an
> > > > Apache-compatible license.
> > > >
> > > > == Required Resources ==
> > > >
> > > > === Mailing Lists ===
> > > >
> > > > We currently use a mix of mailing lists. We will migrate our existing
> > > > mailing lists to the following:
> > > >
> > > > * dev@datasketches.incubator.apache.org
> > > >
> > > > * user@datasketches.incubator.apache.org
> > > >
> > > > * private@datasketches.incubator.apache.org
> > > >
> > > > * commits@datasketches.incubator.apache.org
> > > >
> > > > === Source Control ===
> > > >
> > > > The DataSketches team currently uses Git and would like to continue
> to
> > do
> > > > so. We request a Git repository for DataSketches with mirroring to
> > GitHub
> > > > enabled similar the following:
> > > >
> > > > * https://github.com/apache/incubator-datasketches.git
> > > >
> > > > === Issue Tracking ===
> > > >
> > > > We request the creation of an Apache-hosted JIRA. The DataSketches
> > project
> > > > is currently using the public GitHub issue tracker and the public
> > Google
> > > > Groups forum/sketches-user for issue tracking and discussions. We
> will
> > > > migrate and combine from these two sources to the Apache JIRA.
> > > >
> > > > Proposed Jira ID: DATASKETCHES
> > > >
> > > > == Initial Committers ==
> > > >
> > > > The following list of individuals have been extremely active in our
> > > > community and should have write (commit) permissions to the
> repository.
> > > >
> > > > * Eshcar Hillel                      [eshcar at verizonmedia dot com]
> > > >
> > > > * Kevin Lang                    [langk at verizonmedia dot com]
> > > >
> > > > * Roman Leventov              [roman.leventov at c.metamarkets dot
> com]
> > > >
> > > > * Edo Liberty                   [libertye at amazon dot com]
> > > >
> > > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > > >
> > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> > [leerho
> > > > at gmail dot com]
> > > >
> > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > > >
> > > > * Justin Thaler                 [justin.thaler at georgetown dot edu]
> > > >
> > > > == Affiliations ==
> > > >
> > > > The initial committers are from four organizations: Yahoo, Amazon,
> > > > Georgetown University, and Metamarkets/Snap.
> > > >
> > > > === Champion ===
> > > > (Recommended to me: )
> > > >
> > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > apache
> > > > dot org]
> > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > >
> > > > === Nominated Mentors ===
> > > > (Recommended to me: )
> > > >
> > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> > apache
> > > > dot org]
> > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > >
> > > > === Sponsoring Entity ===
> > > >
> > > > * The Apache Incubator    **** This is our 1st choice ****
> > > >
> > > > * Apache Druid. The incubating Apache Druid project might also be a
> > logical
> > > > sponsor. However, DataSketches has applications in many areas of
> > computing
> > > > outside of Druid so our preference and recommendation is that
> > DataSketches
> > > > would ultimately be a top-level Apache project.
> > > >
> > > > ________________
> > > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> > acquired
> > > > AOL. The merged entity was originally called Oath, Inc., but has
> > recently
> > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
> Verizon,
> > > > Inc.  Since Yahoo is the more recognized name, references in this
> > document
> > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > >
> > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <ke...@apache.org>
> > wrote:
> > > >
> > > > > The subject line has me interested already. Follow examples like
> this
> > > > > maybe?
> > > > >
> > > > > 1.
> > > > >
> > > > >
> > > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > 2.
> > > > >
> > > > >
> > > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > >
> > > > > Kenn
> > > > >
> > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com> wrote:
> > > > >
> > > > > > I'll try again ... :)
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> ted.dunning@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > >> It didn't make it again
> > > > > >>
> > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com> wrote:
> > > > > >>
> > > > > >> > I'm not sure the attached document made it through.
> > > > > >> >
> > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com>
> > wrote:
> > > > > >> >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > > For additional commands, e-mail:
> general-help@incubator.apache.org
> > > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>
-- 
From my cell phone.

Re: DataSketches Proposal

Posted by Kenneth Knowles <ke...@apache.org>.

Can you share the Google doc with the proposal? Per Ted's advice, we can
iterate quickly there and move it to the wiki when it becomes a bit more
stable.

Kenn

On Fri, Feb 22, 2019 at 10:21 PM leerho@gmail.com <le...@gmail.com> wrote:

> Thanks for the offer.  i am a neophyte at this process and email app!   I
> could use a lot of help getting this off the ground!  Also, I'm not sure
> that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
>
> Lee.
>
> On 2019/02/23 06:03:58, Kenneth Knowles <ke...@apache.org> wrote:
> > Nice.
> >
> > I would very much like to help mentor this project, though you already
> have
> > a couple good ones.
> >
> > I concur with incubator as sponsoring entity.
> >
> > Kenn (VP Apache Beam)
> >
> > On Fri, Feb 22, 2019 at 9:45 PM leerho <le...@gmail.com> wrote:
> >
> > > I didn't realize that this mail list does not accept PDF files,
> apparently
> > > only text.  So let me try one more time ... :)  Please let me know if
> > > this works!
> > >
> > >
> > > = Apache DataSketches Proposal[1] =
> > >
> > > == Abstract ==
> > >
> > > DataSketches.GitHub.io is an open source, high-performance library of
> > > stochastic streaming algorithms commonly called "sketches" in the data
> > > sciences. Sketches are small, stateful programs that process massive
> data
> > > as a stream and can provide approximate answers, with mathematical
> > > guarantees, to computationally difficult queries orders-of-magnitude
> faster
> > > than traditional, exact methods.
> > >
> > > This proposal is to move DataSketches to the Apache Software
> > > Foundation(ASF) transferring ownership of its copyright intellectual
> > > property to the ASF.  Thereafter, DataSketches would be officially
> known as
> > > Apache DataSketches and its evolution and governance would come under
> the
> > > rules and guidance of the ASF.
> > >
> > > == Introduction ==
> > >
> > > The DataSketches library contains carefully crafted implementations of
> > > sketch algorithms that meet rigorous standards of quality and
> performance
> > > and provide capabilities required for large-scale production systems
> that
> > > must process and analyze massive data. The DataSketches core
> repository is
> > > written in Java with a parallel core repository written in C++ that
> > > includes Python wrappers. The DataSketches library also includes
> special
> > > repositories for extending the core library for Apache Hive and Apache
> Pig.
> > > The sketches developed in the different languages share a common binary
> > > storage format so that sketches created and stored in Java, for
> example,
> > > can be fully used in C++, and visa versa.  Because the stored sketch
> > > "images" are just a "blob" of bytes (similar to picture images), they
> can
> > > be shared across many different systems, languages and platforms.
> > >
> > > The DataSketches documentation website, https://datasketches.github.io
> ,
> > > includes general tutorials, a comprehensive research section with
> > > references to relevant academic papers, extensive examples for using
> the
> > > core library directly as well as examples for accessing the library in
> > > Hive, Pig, and Apache Spark.
> > >
> > > The DataSketches library also includes a characterization repository
> for
> > > long running test programs that are used for studying accuracy and
> > > performance of these sketches over wide ranges of input variables. The
> data
> > > produced by these programs is used for generating the many performance
> > > plots contained in the documentation website and for academic
> > > publications.
> > >
> > > The code repositories used for production are versioned and published
> to
> > > Maven Central on periodic intervals as the library evolves.
> > >
> > > The DataSketches library also includes several experimental
> repositories
> > > for use-cases outside the large-scale systems environments, such as
> > > sketches for mobile, IoT devices (Android), command-line access of the
> > > sketch library, and an experimental repository for vector-based
> sketches
> > > that performs approximate Singular Value Decomposition (SVD) analysis
> that
> > > could potentially be used in Machine Learning (ML) applications.
> > >
> > > == Background ==
> > >
> > > The DataSketches library was started in 2012 as internal Yahoo project
> to
> > > dramatically reduce time and resources required for distinct (unique)
> > > counting.  An extensive search on the Internet at the time yielded a
> number
> > > of theoretical papers on stochastic streaming algorithms with
> pseudocode
> > > examples, but we did not find any usable open-source code of the
> quality we
> > > felt we needed for our internal production systems.  So we started a
> small
> > > project (one person) to develop our own sketches working directly from
> > > published theoretical papers.
> > >
> > > The DataSketches library was designed from the start with the
> objective of
> > > making these algorithms, usually only described in theoretical papers,
> > > easily accessible to systems developers for use in our internal
> production
> > > systems. By necessity, the code had to be of the highest quality and
> > > thoroughly tested. The wide variety of our internal production systems
> > > drove the requirement that the sketch implementations had to have an
> > > absolute minimum of external, run-time dependencies in order to
> simplify
> > > integration and troubleshooting.
> > >
> > > Our internal experiments demonstrated dramatic positive impact on the
> > > performance of our systems.  As a result, the DataSketches library
> quickly
> > > evolved to include different types of sketches for different types of
> > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > > quantile/histogram algorithms, and weighted and unweighted sampling
> > > algorithms.
> > >
> > > We quickly discovered that developing these sketch algorithms to be
> truly
> > > robust in production environments is quite difficult and requires deep
> > > understanding of the underlying mathematics and statistics as well as
> > > extensive experience in developing high quality code for 24/7
> production
> > > systems. This is a difficult combination of skills for any one
> organization
> > > to collect and maintain over time. It became clear that this technology
> > > needed a community larger than Yahoo to evolve.  In November, 2015,
> this
> > > factor, along with Yahoo’s strong experience and support of open
> source,
> > > led to the decision to open source this technology under an Apache 2.0
> > > license on GitHub. Since that time our community has expanded
> considerably
> > > and the key contributors to this effort includes leading research
> > > scientists from a number of universities as well as practitioners and
> > > researchers from a number of major corporations. The core of this
> group is
> > > very active as we meet weekly to discuss research directions and
> > > engineering priorities.
> > >
> > > It is important to note that our internal systems at Yahoo use the
> current
> > > public GitHub open source DataSketches library and not an internal
> version
> > > of the code.
> > >
> > > The close collaboration of scientific research and engineering
> development
> > > experience with actual massive-data processing systems has also
> produced
> > > new research publications in the field of stochastic streaming
> algorithms,
> > > for example:
> > >
> > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> Rhodes, and
> > > Justin Thaler. A high-performance algorithm for identifying frequent
> items
> > > in data streams. In ACM IMC 2017.
> > >
> > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > > framework for estimating stream expression cardinalities. In *EDBT/ICDT
> > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > >
> > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings
> ‘16,
> > > pages 845-854, 2016.
> > >
> > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78,
> 2016.
> > >
> > > * Kevin J Lang. Back to the future: an even more nearly optimal
> cardinality
> > > estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839,
> > > 2017.
> > >
> > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
> > > Proceedings ‘13, pages 581– 588, 2013.
> > >
> > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> Ullman.
> > > Space lower bounds for itemset frequency sketches. In ACM PODS
> Proceedings
> > > ‘16, pages 441–454, 2016.
> > >
> > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical
> > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> Proceedings
> > > ‘12, pages 160–174, 2012.
> > >
> > > == The Rationale for Sketches ==
> > >
> > > In the analysis of big data there are often problem queries that don’t
> > > scale because they require huge compute resources and time to generate
> > > exact results. Examples include count distinct, quantiles, most
> frequent
> > > items, joins, matrix computations, and graph analysis.
> > >
> > > If we can loosen the requirement of “exact” results from our queries
> and be
> > > satisfied with approximate results, within some well understood bounds
> of
> > > error, there is an entire branch of mathematics and data science that
> has
> > > evolved around developing algorithms that can produce approximate
> results
> > > with mathematically well-defined error properties.
> > >
> > > With the additional requirements that these algorithms must be small
> > > (compared to the size of the input data), sublinear (the size of the
> sketch
> > > must grow at a slower rate than the size of the input stream),
> streaming
> > > (they can only touch each data item once), and mergeable (suitable for
> > > distributed processing), defines a class of algorithms that can be
> > > described as small, stochastic, streaming, sublinear mergeable
> algorithms,
> > > commonly called sketches (they also have other names, but we will use
> the
> > > term sketches from here on).
> > >
> > > To be truly streaming and be able to process data in a single pass,
> > > sketches must make absolute minimum assumptions about the input stream.
> > > This is critically important, as there is no “second chance” to
> process the
> > > data.
> > >
> > > For example, sketches should not make assumptions about the order of
> stream
> > > items, the stream length, the dynamic range of values, or the
> distribution
> > > of item occurrence frequencies. Sketches should be tolerant of NaNs,
> Nulls
> > > and empty objects. About the only thing that the sketch needs to know
> about
> > > the stream is how to extract items from it and what type the item is,
> e.g.,
> > > is it a numeric value or a string.
> > >
> > > As far as the sketch is concerned, the input stream is a sequence of
> items
> > > in some unknown random order with unknown random values.
> > >
> > > The sketch is essentially a complex state machine and combined with the
> > > random input stream defines a stochastic process. We then apply
> > > probabilistic methods to interpret the states of the stochastic
> process in
> > > order to extract useful information about the input stream itself. The
> > > resulting information will be approximate, but we also use additional
> > > probabilistic methods to extract an estimate of the likely probability
> > > distribution of error.
> > >
> > > There is a significant scientific contribution here that is defining
> the
> > > state machine, understanding the resulting stochastic process,
> developing
> > > the probabilistic methods, and proving mathematically, that it all
> works!
> > > This is why the scientific contributors to this project are a critical
> and
> > > strategic component to our success.  The development engineers
> translate
> > > the concepts of the proposed state machine and probabilistic methods
> into
> > > production-quality code. Even more important, they work closely with
> the
> > > scientists, feeding back system and user requirements, which leads not
> only
> > > to superior product design, but to new science as well.  A number of
> > > scientific papers our members have published (see above) is a direct
> result
> > > of this close collaboration.
> > >
> > > Because sketches are small they can be processed extremely fast, often
> many
> > > orders-of-magnitude faster than traditional exact computations. For
> > > interactive queries there may not be other viable alternatives, and in
> the
> > > case of real-time analysis, sketches are the only known solution.
> > >
> > > For any system that needs to extract useful information from massive
> data
> > > sketches are essential tools that should be tightly integrated into the
> > > system’s analysis capabilities. This technology has helped Yahoo
> > > successfully reduce data processing times from days to hours or
> minutes on
> > > a number of its internal platforms and has enabled subsecond queries on
> > > real-time platforms that would have been infeasible without sketches.
> > > The Rationale for Apache DataSketches
> > > Other open source implementations of sketch algorithms can be found on
> the
> > > Internet. However, we have not yet found any open source
> implementations
> > > that are as comprehensive, engineered with the quality required for
> > > production systems, and with usable and guaranteed error properties.
> Large
> > > Internet companies, such as Google and Facebook, have published papers
> on
> > > sketching, however, their implementations of their published
> algorithms are
> > > proprietary and not available as open source.
> > >
> > > The DataSketches library already provides integrations with a number of
> > > major Apache data processing platforms such as Apache Hive, Apache Pig,
> > > Apache Spark and Apache Druid, and is also integrated with a number of
> > > other open source data processing platforms such as Splice Machine,
> GCHQ
> > > Gaffer and PostgreSQL.
> > >
> > > We believe that having DataSketches as an Apache project will provide
> an
> > > immediate, worthwhile, and substantial contribution to the open source
> > > community, will have a better opportunity to provide a meaningful
> > > contribution to both the science and engineering of sketching
> algorithms,
> > > and integrate with other Apache projects.  In addition, this is a
> > > significant opportunity for Apache to be the "go-to" destination for
> users
> > > that want to leverage this exciting technology.
> > >
> > > == Initial Goals ==
> > >
> > > We are breaking our initial goals into short-term (2-6 months) and
> > > intermediate to long-term ( 6 months to 2 years):
> > >
> > > Our short-term goals include:
> > >
> > > * Understanding and adapting to the Apache development process and
> > > structures.
> > >
> > > * Start refactoring codebase and move various DataSketches repositories
> > > code to Apache Git repository.
> > >
> > > * Continue development of new features, functions, and fixes.
> > >
> > > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > > developed and expanded.
> > >
> > >
> > > The intermediate to long term goals include:
> > >
> > > * Completing the design and implementation of the C++ sketches to
> > > complement what is already available in Java, and the Python wrappers
> of
> > > those C++ sketches.
> > >
> > > * Expanding the C++ build framework to include Windows and the popular
> > > Linux variants.
> > >
> > > * Continued engagement with the scientific research community on the
> > > development of new algorithms for computationally difficult problems
> that
> > > heretofore have not had a sketching solution.
> > >
> > > == Current Status ==
> > >
> > > The DataSketches GitHub project has been quite successful.  As of this
> > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > Repository Manager at https://oss.sonatype.org has grown by nearly a
> > > factor
> > > of 10 over the past year to about 55 thousand per month. The
> > > DataSketches/sketches-core repository has about 560 stars and 141
> forks,
> > > which is pretty good for a highly specialized library.
> > >
> > > === Development Practices ===
> > >
> > > ==== Source Control ====
> > >
> > > All of our developers have extensive experience with Git version
> control
> > > and follow accepted practices for use of Pull Requests (PRs), code
> reviews
> > > and commits to master, for example.
> > >
> > > ==== Testing ====
> > >
> > > Sketches, by their nature are probabilistic programs and don’t
> necessarily
> > > behave deterministically.  For some of the sketches we intentionally
> insert
> > > random noise into the code as this gives us the mathematical properties
> > > that we need to guarantee accuracy.  This can make the behavior of
> these
> > > algorithms quite unintuitive and provides significant challenges to the
> > > developer who wishes to test these algorithms for correctness. As a
> result,
> > > our testing strategy includes two major components: unit tests, and
> > > characterization tests.
> > >
> > > ===== Unit Testing =====
> > >
> > > Our unit tests are primarily quick tests to make sure that we exercise
> all
> > > critical paths in the code and that key branches are executed
> correctly. It
> > > is important that they execute relatively fast as they are generally
> run on
> > > every code build. The sketches-core repository alone has about 22
> thousand
> > > statements, over 1300 unit tests and code coverage of about 98.2% as
> > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > repositories that are used in production that they have code coverage
> > > greater than 90%.
> > >
> > > ===== Characterization Testing =====
> > >
> > > In order to test the probabilistic methods that are used to interpret
> the
> > > stochastic behaviors of our sketches we have a separate
> characterization
> > > repository that is dedicated to this.  To measure accuracy, for
> example,
> > > requires running thousands of trials at each of many different points
> along
> > > the domain axis. Each trial compares its estimated results against a
> known
> > > exact result producing an error for that trial.  These error
> measurements
> > > are then fed into our Quantiles sketch to capture the actual
> distribution
> > > of error at that point along the axis. We then select quantile contours
> > > across all the distributions at points along the axis.  These contours
> can
> > > then be plotted to reveal the shape of the actual error distribution.
> These
> > > distributions are not at all Gaussian, in fact they can be quite
> complex.
> > > Nonetheless, these distributions are then checked against our
> statistical
> > > guarantees inherent to the specific sketch algorithm and its
> parameters.
> > > There are many examples of these characterization error distributions
> on
> > > our website. The runtimes of these tests can be very long and can range
> > > from many minutes to hours, and some can run for days.  Currently, we
> have
> > > separate characterization repositories for Java and C++ / Python.
> > >
> > > It is our goal that we perform this characterization analysis for all
> of
> > > our sketches.  By definition, the code that runs these characterization
> > > tests is open-source so others can run these tests as well.  We do not
> have
> > > formal releases of this code (because it is not production code) and
> it is
> > > not published to Maven Central.
> > >
> > > === Meritocracy ===
> > >
> > > DataSketches was initially developed based on requirements within
> Yahoo. As
> > > a project on GitHub, DataSketches has received contributions from
> numerous
> > > individual developers from around the world, dedicated research work
> from
> > > senior scientists at Amazon and Visa, and academic researchers from
> > > Georgetown University, Princeton, and MIT.
> > >
> > > As a project under incubation, we are committed to expanding our
> effort to
> > > build an environment which supports a meritocracy. We are focused on
> > > engaging the community and other related projects for support and
> > > contributions. Moreover, we are committed to ensure contributors and
> > > committers to DataSketches come from a broad mix of organizations
> through a
> > > merit-based decision process during incubation. We believe strongly in
> the
> > > DataSketches premise that fulfills the concept of a well engineered and
> > > scientifically rigorous library that implements these powerful
> algorithms
> > > and are committed to growing an inclusive community of DataSketches
> > > contributors and users.
> > >
> > > === Community ===
> > >
> > > Yahoo has a long history and active engagement in the Open Source
> > > community. Major projects include: Vespa.ai, Bullet, Moloch, Panoptes,
> > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> gifshot,
> > > fluxible, as well as the creation, contribution and incubation of many
> > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> Zookeeper,
> > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > >
> > > Every day, DataSketches is actively used by a organizations and
> > > institutions around the world for batch and stream processing of data.
> We
> > > believe acceptance will allow us to consolidate existing
> > > DataSketches-related work, grow the DataSketches community, and deepen
> > > connections between DataSketches and other open source projects.
> > >
> > > === Introduction to the Core Developers & Contributors ===
> > >
> > > The core developers and contributors for DataSketches are from diverse
> > > backgrounds, but primarily are scientists that love engineering and
> > > engineers that love science. A large part of the value we bring comes
> from
> > > this synthesis.  These individuals have already contributed
> substantially
> > > to the code, algorithms, and/or mathematical proofs that form the
> basis of
> > > the library.
> > >
> > > This core group also form the Initial Committers with write
> permissions to
> > > the repository. Those marked with (*) Meet weekly to plan the research
> and
> > > engineering direction of the project.
> > >
> > > ==== Scientists That Love Engineering ====
> > >
> > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> Interests:
> > > distributed systems, scalable systems and platforms for big data
> > > processing, concurrent algorithms and data structures,
> > >
> > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> Sunnyvale,
> > > California. Interests: algorithms, theoretical and applied mathematics,
> > > encoding and compression theory, theoretical and applied performance
> > > optimization.
> > >
> > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo
> Alto,
> > > California. Manages the algorithms group at Amazon AI. We build
> scalable
> > > machine learning systems and algorithms which are used both internally
> and
> > > externally by customers of SageMaker, AWS's flagship machine learning
> > > platform.
> > >
> > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
> > > Computational advertising, machine learning, speech recognition,
> > > data-driven analysis, large scale experimentation, big data,
> stream/complex
> > > event processing
> > >
> > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> Science,
> > > Georgetown University, Washington D.C. Interests: algorithms and
> > > computational complexity, complexity theory, quantum algorithms,
> private
> > > data analysis, and learning theory, developing efficient streaming and
> > > sketching algorithms
> > >
> > > ==== Engineers That Love Science ====
> > >
> > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> Interests:
> > > design and implementation of data storing and data processing
> (distributed)
> > > systems, performance optimization, CPU performance, mechanical
> sympathy,
> > > JVM performance, API design, databases, (concurrent) data structures,
> > > memory management, garbage collection algorithms, language design and
> > > runtimes (their tradeoffs), distributed systems (cloud) efficiency,
> Linux,
> > > code quality, code transformation, pure functional programming models,
> > > Haskell.
> > >
> > > * Lee Rhodes: (*) Distinguished Architect, lead developer and founder
> of
> > > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > > streaming algorithms, mathematics, computer science, high quality and
> high
> > > performance code for the analysis of massive data, bridging the divide
> > > between theory and practice.
> > >
> > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
> > > California. Interests: applied mathematics, computer science, big data,
> > > distributed systems.
> > >
> > > === Introduction to Additional Interested Contributors ===
> > >
> > > These folks have been intermittently involved and contributed, but are
> > > strong supporters of this project.
> > >
> > > * Frank Grimes: GitHub ID: frankgrimes97
> > >
> > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> Science,
> > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > approximation, streaming algorithms, randomized linear algebra.
> > >
> > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> Computer
> > > Science, Research Instructor, Princeton University. Interests:
> algorithmic
> > > foundations of data science and machine learning, efficient methods for
> > > processing and understanding large datasets, often working at the
> > > intersection of theoretical computer science, numerical linear
> algebra, and
> > > optimization.
> > >
> > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer
> Science,
> > > Professor, Warwick University, Warwick, England. Interests: all
> aspects of
> > > the "data lifecycle", from data collection and cleaning, through
> mining and
> > > analytics. (Professor Cormode is one of the world’s leading scientists
> in
> > > sketching algorithms)
> > >
> > > === Alignment ===
> > >
> > > The DataSketches library already provides integrations and example
> code for
> > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into
> Apache
> > > Druid.
> > >
> > > == Known Risks ==
> > >
> > > The following subsections are specific risks that have been identified
> by
> > > the ASF that need to be addressed.
> > >
> > > === Risk: Orphaned Products ===
> > >
> > > The DataSketches library is presently used by a number of
> organizations,
> > > from small startups to Fortune 100 companies, to construct production
> > > pipelines that must process and analyze massive data. Yahoo has a
> long-term
> > > commitment to continue to advance the DataSketches library; moreover,
> > > DataSketches is seeing increasing interest, development, and adoption
> from
> > > many diverse organizations from around the world. Due to its growing
> > > adoption, we feel it is quite unlikely that this project would become
> > > orphaned.
> > >
> > > === Risk: Inexperience with Open Source ===
> > >
> > > Yahoo believes strongly in open source and the exchange of information
> to
> > > advance new ideas and work. Examples of this commitment are active open
> > > source projects such as those mentioned above. With DataSketches, we
> have
> > > been increasingly open and forward-looking; we have published a number
> of
> > > papers about breakthrough developments in the science of streaming
> > > algorithms (mentioned above) that also reference the DataSketches
> library.
> > > Our submission to the Apache Software Foundation is a logical
> extension of
> > > our commitment to open source software.
> > >
> > > Key committers at Yahoo with strong open source backgrounds include
> Aaron
> > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews
> > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn
> > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel,
> > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal,
> Gil
> > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> Penick,
> > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> Kihwal
> > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
> > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. Natkovich,
> > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo,
> Ryan
> > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan, Sri
> > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > >
> > > All of our core developers are committed to learn about the Apache
> process
> > > and to give back to the community.
> > >
> > > === Risk: Homogeneous Developers ===
> > >
> > > The majority of committers in this proposal belong to Yahoo due to the
> fact
> > > that DataSketches has emerged from an internal Yahoo project. This
> proposal
> > > also includes developers and contributors from other companies, and
> who are
> > > actively involved with other Apache projects, such as Druid.  We
> expect our
> > > entry into incubation will allow us to expand the number of
> individuals and
> > > organizations participating in DataSketches development.
> > >
> > > === Risk: Reliance on Salaried Developers ===
> > >
> > > Because the DataSketches library originated within Yahoo, it has been
> > > developed primarily by salaried Yahoo developers and we expect that to
> > > continue to be the case near term. However, since we placed this
> library
> > > into open-source we have had a number of significant contributions from
> > > engineers and scientists from outside of Yahoo. We expect our reliance
> on
> > > Yahoo salaried developers will decrease over time. Nonetheless, Yahoo
> is
> > > committed to continue its strong support of this important project.
> > >
> > > === Risk: Lack of Relationship to other Apache Products ===
> > >
> > > DataSketches already directly interoperates with or utilizes several
> > > existing Apache projects.
> > >
> > > * Build
> > >    * Apache Maven
> > >
> > > * Integrations and adaptors for the following projects naturally have
> them
> > > as dependencies
> > >    * Apache Hive
> > >    * Apache Pig
> > >    * Apache Druid
> > >    * Apache Spark
> > >
> > > * Additional dependencies for the above integrations and adaptors
> include
> > >    * Apache Hadoop
> > >    * Apache Commons (Math)
> > >
> > > There is no other Apache project that we are aware of that duplicates
> the
> > > functionality of the DataSketches library.
> > >
> > > === Risk: An Excessive Fascination with the Apache Brand ===
> > >
> > > With this proposal we are not seeking attention or publicity. Rather,
> we
> > > firmly believe in the DataSketches library and concept and the ability
> to
> > > make the DataSketches library a powerful, yet simple-to-use toolkit for
> > > data processing. While the DataSketches library has been open source,
> we
> > > believe putting code on GitHub can only go so far. We see the Apache
> > > community, processes, and mission as critical for ensuring the
> DataSketches
> > > library is truly community-driven, positively impactful, and innovative
> > > open source software. While Yahoo has taken a number of steps to
> advance
> > > its various open source projects, we believe the DataSketches library
> > > project is a great fit for the Apache Software Foundation due to its
> focus
> > > on data processing and its relationships to existing ASF projects.
> > >
> > > === Risk: Cryptography ===
> > >
> > > DataSketches does not contain any cryptographic code and is not a
> > > cryptographic product.
> > >
> > > == Documentation ==
> > >
> > > The following documentation is relevant to this proposal. Relevant
> portions
> > > of the documentation will be contributed to the Apache DataSketches
> > > project.
> > >
> > > * DataSketches website: https://datasketches.github.io.
> > >
> > > * DataSketches website repository:
> > > https://github.com/DataSketches/DataSketches.github.io
> > >
> > > We will need an apache website for this documentation similar to
> > >
> > > * https://datasketches.apache.org
> > >
> > > == Initial Source ==
> > >
> > > The initial source for DataSketches which we will submit to the Apache
> > > Foundation will include a number of repositories which are currently
> hosted
> > > under the GitHub.com/datasketches organization:
> > >
> > > All github.com/datasketches repositories including:
> > >
> > > * Java
> > >    * sketches-core: This repository has the core sketching classes,
> which
> > > are leveraged by some of the other repositories. This repository has no
> > > external dependencies outside of the DataSketches/memory repository,
> Java
> > > and TestNG for unit tests. This code is versioned and the latest
> release
> > > can be obtained from Maven Central.
> > >    * memory: Low level, high-performance memory data-structure
> management
> > > primarily for off-heap.
> > >    * sketches-android: This is a new repository dedicated to sketches
> > > designed to be run in a mobile client, such as a cell phone. It is
> still in
> > > development and should be considered experimental.
> > >    * sketches-hive: This repository contains Hive UDFs and UDAFs for
> use
> > > within Hadoop grid environments. This code has dependencies on
> > > sketches-core as well as Hadoop and Hive. Users of this code are
> advised to
> > > use Maven to bring in all the required dependencies. This code is
> versioned
> > > and the latest release can be obtained from Maven Central.
> > >    * sketches-pig: This repository contains Pig User Defined Functions
> > > (UDF) for use within Hadoop grid environments. This code has
> dependencies
> > > on sketches-core as well as Hadoop and Pig. Users of this code are
> advised
> > > to use Maven to bring in all the required dependencies. This code is
> > > versioned and the latest release can be obtained from Maven Central.
> > >    * sketches-vector: This is a new repository dedicated to sketches
> for
> > > vector and matrix operations. It is still somewhat experimental.
> > >    * characterization: This relatively new repository is for code that
> we
> > > use to characterize the accuracy and speed performance of the sketches
> in
> > > the library and is constantly being updated. Examples of the job
> command
> > > files used for various tests can be found in the src/main/resources
> > > directory. Some of these tests can run for hours depending on its
> > > configuration.
> > >    * experimental: This repository is an experimental staging area for
> code
> > > that will eventually end up in another repository. This code is not
> > > versioned and not registered with Maven Central.
> > >    * sketches-misc: Demos and other code not related to production
> > > deployment
> > >
> > > * C++ and Python
> > >    * sketches-core-cpp: This is the C++/Python companion to the Java
> > > sketches-core. These implementations are binary compatible with their
> > > counterparts in Java. In other words, a sketch created and stored in
> C++
> > > can be opened and read in Java and visa-versa. This site also has our
> > > Python adaptors that basically wrap the C++ implementations, making the
> > > high performance C++ implementations available from Python.
> > >    * sketches-postgres: This site provides the postgres-specific
> adaptors
> > > that wrap the C++ implementations making them available to the Postgres
> > > database users.
> > >    * characterization-cpp: This is the C++/Python companion to the Java
> > > characterization repository.
> > >    * experimental-cpp: This repository is an experimental staging area
> for
> > > C++ code that will eventually end up in another repository.
> > >
> > > * Command-Line Tools
> > >    * sketches-cmd
> > >    * homebrew-sketches
> > >    * homebrew-sketches-cmd
> > >
> > > These projects have always been Apache 2.0 licensed. We intend to
> bundle
> > > all of these repositories since they are all complementary and should
> be
> > > maintained in one project. Prior to our submission, we will combine
> all of
> > > these projects into a new git repository.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > Contributors to the DataSketches project have also signed the Yahoo
> > > Individual Contributor License Agreement (
> https://yahoocla.herokuapp.com/
> > > in order to contribute to the project.
> > >
> > > With respect to trademark rights, Yahoo does not hold a trademark on
> the
> > > phrase “DataSketches.” Based on feedback and guidance we receive
> during the
> > > incubation process, we are open to renaming the project if necessary
> for
> > > trademark or other concerns, but we would prefer not to have to do
> that.
> > >
> > > == External Dependencies ==
> > >
> > > All external dependencies are licensed under an Apache 2.0 or
> > > Apache-compatible license. As we grow the DataSketches community we
> will
> > > configure our build process to require and validate all contributions
> and
> > > dependencies are licensed under the Apache 2.0 license or are under an
> > > Apache-compatible license.
> > >
> > > == Required Resources ==
> > >
> > > === Mailing Lists ===
> > >
> > > We currently use a mix of mailing lists. We will migrate our existing
> > > mailing lists to the following:
> > >
> > > * dev@datasketches.incubator.apache.org
> > >
> > > * user@datasketches.incubator.apache.org
> > >
> > > * private@datasketches.incubator.apache.org
> > >
> > > * commits@datasketches.incubator.apache.org
> > >
> > > === Source Control ===
> > >
> > > The DataSketches team currently uses Git and would like to continue to
> do
> > > so. We request a Git repository for DataSketches with mirroring to
> GitHub
> > > enabled similar the following:
> > >
> > > * https://github.com/apache/incubator-datasketches.git
> > >
> > > === Issue Tracking ===
> > >
> > > We request the creation of an Apache-hosted JIRA. The DataSketches
> project
> > > is currently using the public GitHub issue tracker and the public
> Google
> > > Groups forum/sketches-user for issue tracking and discussions. We will
> > > migrate and combine from these two sources to the Apache JIRA.
> > >
> > > Proposed Jira ID: DATASKETCHES
> > >
> > > == Initial Committers ==
> > >
> > > The following list of individuals have been extremely active in our
> > > community and should have write (commit) permissions to the repository.
> > >
> > > * Eshcar Hillel                      [eshcar at verizonmedia dot com]
> > >
> > > * Kevin Lang                    [langk at verizonmedia dot com]
> > >
> > > * Roman Leventov              [roman.leventov at c.metamarkets dot com]
> > >
> > > * Edo Liberty                   [libertye at amazon dot com]
> > >
> > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > >
> > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> [leerho
> > > at gmail dot com]
> > >
> > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > >
> > > * Justin Thaler                 [justin.thaler at georgetown dot edu]
> > >
> > > == Affiliations ==
> > >
> > > The initial committers are from four organizations: Yahoo, Amazon,
> > > Georgetown University, and Metamarkets/Snap.
> > >
> > > === Champion ===
> > > (Recommended to me: )
> > >
> > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> apache
> > > dot org]
> > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > >
> > > === Nominated Mentors ===
> > > (Recommended to me: )
> > >
> > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> apache
> > > dot org]
> > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > Gil Yehuda, gyehuda at verizonmedia dot com
> > >
> > > === Sponsoring Entity ===
> > >
> > > * The Apache Incubator    **** This is our 1st choice ****
> > >
> > > * Apache Druid. The incubating Apache Druid project might also be a
> logical
> > > sponsor. However, DataSketches has applications in many areas of
> computing
> > > outside of Druid so our preference and recommendation is that
> DataSketches
> > > would ultimately be a top-level Apache project.
> > >
> > > ________________
> > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> acquired
> > > AOL. The merged entity was originally called Oath, Inc., but has
> recently
> > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon,
> > > Inc.  Since Yahoo is the more recognized name, references in this
> document
> > > to Yahoo, are also a reference to Verizon Media, Inc.
> > >
> > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> > >
> > > > The subject line has me interested already. Follow examples like this
> > > > maybe?
> > > >
> > > > 1.
> > > >
> > > >
> > >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > 2.
> > > >
> > > >
> > >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > >
> > > > Kenn
> > > >
> > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <le...@gmail.com> wrote:
> > > >
> > > > > I'll try again ... :)
> > > > >
> > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> It didn't make it again
> > > > >>
> > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <le...@gmail.com> wrote:
> > > > >>
> > > > >> > I'm not sure the attached document made it through.
> > > > >> >
> > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <le...@gmail.com>
> wrote:
> > > > >> >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>