You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Beng Chin OOI <oo...@comp.nus.edu.sg> on 2022/04/13 02:22:47 UTC
Proposal: to incubate a 100% java based Cohort OLAP system
Dear Apache Incubator Community,
This was earlier sent by QingPeng.
I am Beng Chin -- the initiator and a PMC of Apache SINGA
(https://singa.apache.org/) -- an Apache top level project.
We would like to contribute COOL as an Apache Incubator project.
COOL is a cohort OLAP system specialized for cohort analysis with
extremely low latency. The vision of COOL is to address the inefficiency
of underlying database systems for cohort analysis (cohort queries)
which is increasingly used in areas such as customer acquisition and
CRM, medical cohort analysis (response of patients with respect to
treatment/medicine etc) and fraud detection.
COOL is very fast -- one to two order magnitude faster than
implementation on known database systems using SQL.
It is flexible and powerful -- it can process complicated cohort
queries with flexible definitions of cohorts
and events in near real-time response time.
COOL works with a few open source storage engines/backends, namely,
Apache Avro, Apache Arrow, Apache Parquet, etc.
We need Champions and Mentors, to help guide us on further development
of this open source project.
Thanks.
Regards,
Beng Chin Ooi
www.comp.nus.edu.sg/~ooibc
on behalf of COOL team
-----------------------------------------------------------
# COOL proposal
## Abstract
COOL is an online cohort analytical processing system specialized for
cohort analysis with extremely low latency.
## Proposal
The vision of COOL is to address the inefficiency of underlying database
systems processing cohort analysis (cohort queries) which is an emerging
and widely-used analysis pattern in various areas. In COOL, cohort query
processing is facilitated by specialized operators that involve only two
fast scans on sophisticated storage to achieve real-time responses.
COOL has been designed to provide user-friendly querying primitives to
address the pain point of writing complex and lengthy queries for cohort
analysis using SQL-like languages. Specifically, at least five SQL
queries are needed for a conventional OLAP database system to perform
cohort analysis in a non-intrusive manner.
We submit this proposal to donate the COOL system, its related code, and
artifacts (documentation, website application, wiki, etc) to Apache
Software Foundation Incubator. We are confident that COOL will further
promote the diversity of the Apache community and the Apache is able to
provide COOL with a better environment to build its community, making it
a useful and efficient tool for large-scale cohort analysis.
##Background
Cohort analysis (https://en.wikipedia.org/wiki/Cohort_analysis for quick
reference) is a method of analyzing metrics across different groups
(i.e., cohorts), which share common characteristics in the accumulated
data. These characteristics play a critical role in user profiling and
the decision-making process in data-driven organizations.
For example, cohort analysis is useful in customer retention analysis
and the effectiveness of a promotional event. Observing the growth of
users alongside running the user acquisition, or observing player
progression in online gaming, we can evaluate how different groups of
players evolve as time progresses. The efficiency of cohort query
processing is vital in such a scenario as analysts may have to work out
strategies promptly for the online service.
Another example of cohort analysis could be a side-effect evaluation of
a clinical trial, in which the clinicians want to monitor and determine
the effectiveness of new medicines among different patient groups.
Unlike the analysts for online services, the clinicians can wait for a
much longer duration (over months or even years) to study the
effectiveness of treatments, etc. However, it is difficult for any
clinician to construct complex cohort queries (using SQL) to conduct
cohort analysis.
With the target of providing near real-time cohort analysis responses,
COOL was initiated as a research project around 2016. It has been used
for various real-world applications, such as sales of online game
gadgets/equipment, and sales of virtual assets and gears in online
games. The COOL system has been designed as a very efficient cohort
analytical processing system with a fast response time and flexible
definition of cohorts and events. It is at least one order of magnitude
faster than cohort processing using a conventional database engine.
For ease of use, COOL accepts a single self-defined query in JSON
format, rather than multiple complex SQL statements.
##Rationale
There is a strong need to support cohort analysis efficiently and
effectively with the society evolving and COOL meets such need greatly.
The querying response of cohort analysis in COOL is real-time, which is
at least one order of magnitude compared to traditional OLAP systems.
Meanwhile, COOL accepts a single self-defined query in JSON format,
rather than multiple complex SQL statements. Besides, COOL can also
integrate data from different data sources.
##Initial Goals
The initial goal is to move the existing codebase to Apache Software
Foundation and improve it with the standard Apache development process.
We plan for incremental development in the following directions: more
storage connectors, more file format parsers, a feasible caching
mechanism, and utilizing COOL's cohort results to facilitate building
machine learning models. All these will be released in stages with the
community following the Apache process.
##Current Status
COOL was started as a research project in the database system lab of NUS
around 2016. All the codes are made available under Apache License V2,
and the related artifacts can be found on Github.
The introduction website of the COOL system: http://13.212.103.48:3001/
The GitHub for the source code of the COOL system:
https://github.com/COOL-cohort/COOL
The GitHub for the source code of the COOL website:
https://github.com/COOL-cohort/COOL-site
The GitHub for the source code of the COOL webapp:
https://github.com/COOL-cohort/COOL-webapp
###Meritocracy
The project was originally created by David Jiang, Qingchao Cai, and
Zhongle Xie. And the project now has committers and users from both
different organizations in Singapore and China.
The committers of the project are all joined by submitting codes fixing
bugs and providing new features. If the proposal were accepted, we would
work to select PPMC members for the project and continuously operate in
the Apache way.
###Community
Although we are in the early stage of building a well-organized
community, the need for cohort analysis is growing, especially as part
of deep customer relationship management (CRM) and medical cohort
analysis. Therefore, COOL should be able to attract more contributors
to join our community to improve its codebase. Besides, we also have
many experienced developers who have participated in building the Apache
SINGA and other open sources, and we are capable of organizing a
well-developed community for COOL.
###Core Developers
Thus far, the initial core developers of COOL are experienced
researchers and engineers primarily from the National Unversity of
Singapore and Zhejiang University. We have new developers from US
industry.
A few early core developers have been involved in Apache SINGA and hence
are familiar with Apache process.
###Alignment
Apache Incubator would be a perfect fit for the project for the
following reasons:
1. COOL enriches the ecosystem of OLAP systems for underlying Apache
Projects since there is no specialized cohort analytical system in the
current project list.
2. The developer team of COOL is familiar with the Apache process and
way. The lab has already contributed Apache SINGA, a Top-Level Project,
to the foundation and a few members from Apache SINGA have joined the
COOL team.
3. Joining Apache can help attract and coordinate development efforts
from companies.
4. COOL can naturally connect with Apache projects like HDFS and
ZooKeeper.
##Known Risks
Currently, the development team members are mostly from universities and
research institutions. The team fully becomes an "Apache-style" project,
the project needs to embrace more developers from the industry or the
community.
###Project Name
The name (i.e., COOL) is short and easy to be remembered, and we do not
find any similar names or projects which may cause conflict to the best
of our knowledge. Hence, we believe the name COOL should be suitable for
this project.
###Orphaned products
We believe that the COOL system will draw more attention from users in
the industry and attract more developers to contribute to both the
codebase and community because COOL can not only conduct cohort analysis
with extremely low latency but also simplify the cohort queries without
defining complex joint expressions.
We have already developed a website application to facilitate possible
users to use our COOL system to conduct cohort analysis.
In practice, we also have deployed the COOL system in National
University Health System to assist clinicians in analyzing insightful
patterns among COVID19 patients from cohort results. Meanwhile, the team
has cooperated with a few companies in building their user cohort
analysis applications.
We plan to improve the COOL system from different aspects, such as more
storage connectors, more file format parsers, a feasible caching
mechanism, and utilizing COOL's cohort results to facilitate building
machine learning models.
###Inexperience with Open Source
Our initial committers include several experienced developers who had
participated in the Apache SINGA project. In fact, some of them are the
core contributors and from the PPMC of the project. Hence, we have the
experience to grow the community and maintain participation.
###Length of Incubation
We have made preliminary plans on improving the COOL from different
aspects and are devoted to realizing them. Besides, our committers are
experienced in developing open-source projects and have participated in
growing a well-organized community. Hence, we believe all these steps
are realizable.
###Homogenous Developers
The current core developers mainly are researchers from the National
University of Singapore and Zhejiang University. We also have a small
number of developers from ByteDance and other enterprises. We do want to
build a well-organized community and encourage developers to join and
promote the development of our COOL system.
###Reliance on Salaried Developers
Most of the developers are working for research labs, and universities
or are studying for their doctorate. They build the COOL system while
conducting their research on cohort analysis and cohort-based neural
network models. The COOL system will be a powerful tool to facilitate
advanced cohort analytics in the commercial world and scientific
research that exploit the use of cohort of analysis (eg. Reaction to
drugs and treatments.)
###Relationships with Other Apache Products
COOL has naturally connected with Apache projects like HDFS and
ZooKeeper. Besides, COOL is supporting Parquet files as a method to load
data from other systems into COOL and export data for other downstream
analysis tasks. Supports for Apache Avro and Apache Arrow are also on
our schedule.
###A Excessive Fascination with the Apache Brand
Without a doubt, we appreciate the reputation of the Apache brand, which
will help to attract contributors and users. We also appreciate the
Apache development process. We believe that COOL, as a specialized OLAP
system for cohort analysis, can promote the diversification of the
Apache community.
##Documentation
The introduction of the COOL system can be found in:
http://13.212.103.48:3001/
##Initial Source
The codebase of the COOL system is based on Java and relies on Maven to
compile and build the COOL engine. Besides, we also prepare website
applications and interesting use cases to demonstrate how to leverage
COOL. More details can be found on the Introduction webpage or the Git
repositories.
###Source and Intellectual Property Submission Plan
Once COOL is accepted and sponsored by Apache, we can transfer all
source codes and copyrights to the Apache Software Foundation.
###External Dependencies
All dependencies of the COOL system comply with the Apache License V2.
###Cryptography
Not applicable to COOL.
##Required Resources
###Mailing lists
We plan to use the following mailing lists:
• users@cool.incubator.apache.org
• dev@cool.incubator.apache.org
• private@cool.incubator.apache.org
• commits@cool.incubator.apache.org
###Subversion Directory
We prefer to continue using Git to control our COOL system development.
###Git Repositories
• COOL system: https://github.com/COOL-cohort/COOL
• COOL website: https://github.com/COOL-cohort/COOL-site
• COOL webapp: https://github.com/COOL-cohort/COOL-webapp
###Issue Tracking
We would like to use JIRA to track issues.
##Initial Committers
• Beng Chin Ooi (ooibc@comp.nus.edu.sg)
• Zhongle Xie (xiezl@zju.edu.cn)
• Meihui Zhang (meihui_zhang@bit.edu.cn)
• Qingpeng Cai (qingpeng@comp.nus.edu.sg)
• Naili Xing (dcsxing@nus.edu.sg)
• Guoyu Hu (guoyu.hu@u.nus.edu)
• Hongbin Ying (yinghongbin@mzhtechnologies.com)
• Changshuo Liu (changshuo@u.nus.edu)
• Fei Xiao (fxiao004@comp.nus.edu.sg)
• Yuncheng Wu (dcswuyu@nus.edu.sg)
• Gang Chen (cg@zju.edu.cn)
• Pengyuan Shen (shenpy@mzhtechnologies.com)
• Chenghao Cai (chenghao.cai@nusri.cn)
• Ishant virendra Wankhede (ishant.virendra.wankhede@walmart.com)
* Linsey Pang (xpang@salesforce.com; panglinsey@gmail.com)
* Raghav Chalapathy (raghav.chalapathy@gmail.com;
raghav.chalapathy@walmart.com)
##Affiliations
• Beng Chin Ooi, National University of Singapore
• Zhongle Xie, Zhejiang University
• Meihui Zhang, Beijing Institute of Technology
• Qingpeng Cai, National University of Singapore
• Naili Xing, National University of Singapore
• Hongbin Ying, MZH Technologies
• Guoyu Hu, National University of Singapore
• Changshuo Liu, National University of Singapore
• Fei Xiao, National University of Singapore
• Yuncheng Wu, National University of Singapore
• Gang Chen, Zhejiang University
• Pengyuan Shen, MZH Technologies
• Chenghao Cai, NUS AI Innovation and Commercialisation Centre
• Ishant virendra Wankhede, Walmart
* Linsey Pang, Salesforce
* Raghav Chalapathy, Walmart
##Sponsors
###Champion
TODO
###Nominated Mentors
TODO
###Sponsoring Entity
The Apache Incubator
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org