You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Henry Saputra <he...@gmail.com> on 2016/11/23 23:30:21 UTC

[DISCUSS] Proposing Griffin for Apache incubator

Hi All,

As the champion for Griffin, I would like to bring up discussion to
bring the project as Apache incubator podling.

Here is the direct quote from the abstract:

"
Griffin is a Data Quality Service platform built on Apache Hadoop and
Apache Spark. It provides a framework process for defining data
quality model, executing data quality measurement, automating data
profiling and validation, as well as a unified data quality
visualization across multiple data systems. It tries to address the
data quality challenges in big data and streaming context.
"

Here is the link to the proposal:
https://wiki.apache.org/incubator/GriffinProposal

I have copied the proposal below for easy access


Thanks,

- Henry


Griffin Proposal

Abstract

Griffin is a Data Quality Service platform built on Apache Hadoop and
Apache Spark. It provides a framework process for defining data
quality model, executing data quality measurement, automating data
profiling and validation, as well as a unified data quality
visualization across multiple data systems. It tries to address the
data quality challenges in big data and streaming context.

Proposal

Griffin is a open source Data Quality solution for distributed data
systems at any scale in both streaming or batch data context. When
people use open source products (e.g. Apache Hadoop, Apache Spark,
Apache Kafka, Apache Storm), they always need a data quality service
to build his/her confidence on data quality processed by those
platforms. Griffin creates a unified process to define and construct
data quality measurement pipeline across multiple data systems to
provide:

Automatic quality validation of the data
Data profiling and anomaly detection
Data quality lineage from upstream to downstream data systems.
Data quality health monitoring visualization
Shared infrastructure resource management

Overview of Griffin

Griffin has been deployed in production at eBay serving major data
systems, it takes a platform approach to provide generic features to
solve common data quality validation pain points. Firstly, user can
register the data asset which user wants to do data quality check. The
data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
system or near real-time streaming data from Apache Kafka, Apache
Storm and other real time data platforms. Secondly, user can create
data quality model to define the data quality rule and metadata.
Thirdly, the model or rule will be executed automatically (by the
model engine) to get the sample data quality validation results in a
few seconds for streaming data. Finally, user can analyze the data
quality results through built-in visualization tool to take actions.

Griffin includes:

Data Quality Model Engine

Griffin is model driven solution, user can choose various data quality
dimension to execute his/her data quality validation based on selected
target data-set or source data-set ( as the golden reference data). It
has a corresponding library supporting it in back-end for the
following measurement:

Accuracy - Does data reflect the real-world objects or a verifiable source
Completeness - Is all necessary data present
Validity - Are all data values within the data domains specified by the business
Timeliness - Is the data available at the time needed
Anomaly detection - Pre-built algorithm functions for the
identification of items, events or observations which do not conform
to an expected pattern or other items in a dataset
Data Profiling - Apply statistical analysis and assessment of data
values within a dataset for consistency, uniqueness and logic.

Data Collection Layer

We support two kinds of data sources, batch data and real time data.

For batch mode, we can collect data source from Apache Hadoop based
platform by various data connectors.

For real time mode, we can connect with messaging system like Kafka to
near real time analysis.

Data Process and Storage Layer

For batch analysis, our data quality model will compute data quality
metrics in our spark cluster based on data source in Apache Hadoop.

For near real time analysis, we consume data from messaging system,
then our data quality model will compute our real time data quality
metrics in our spark cluster. for data storage, we use time series
database in our back end to fulfill front end request.

Griffin Service

We have RESTful web services to accomplish all the functionalities of
Griffin, such as register data asset, create data quality model,
publish metrics, retrieve metrics, add subscription, etc. So, the
developers can develop their own user interface based on these web
services.

Background

At eBay, when people play with big data in Apache Hadoop (or other
streaming data), data quality often becomes one big challenge.
Different teams have built customized data quality tools to detect and
analyze data quality issues within their own domain. We are thinking
to take a platform approach to provide shared Infrastructure and
generic features to solve common data quality pain points. This would
enable us to build trusted data assets.

Currently it’s very difficult and costly to do data quality validation
when we have big data flow across multi-platforms at eBay (e.g.
Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
MongoDB). Take eBay real time personalization platform as an example.
Every day we have to validate data quality status for ~600M records (
imagine we have 150M active users for our website). Data quality often
becomes one big challenge both in its streaming and batch pipelines.

So we conclude 3 data quality problems at eBay:

Lack of end2end unified view of data quality measurement from multiple
data sources to target applications, it usually takes a long time to
identify and fix poor data quality.
How to get data quality measured in streaming mode, we need to have a
process and tool to visualize data quality insights through
registering dataset which you want to check data quality, creating
data quality measurement model, executing the data quality validation
job and getting metrics insights for action taking.
No Shared platform and API Service, have to apply and manage own
hardware and software infrastructure.

Rationale

The challenge we face at eBay is that our data volume is becoming
bigger and bigger, system processes become more complex, while we do
not have a unified data quality solution to ensure the trusted data
sets which provide confidences on data quality to our data consumers.
The key challenges on data quality includes:

Existing commercial data quality solution cannot address data quality
lineage among systems, cannot scale out to support fast growing data
at eBay
Existing eBay's domain specific tools take a long time to identify and
fix poor data quality when data flowed through multiple systems
Business logic becomes complex, requires data quality system much flexible.

Some data quality issues do have business impact on user experiences,
revenue, efficiency & compliance.

Communication overhead of data quality metrics, typically in a big
organization, which involve different teams.

The idea of Griffin is to provide Data Quality validation as a
Service, to allow data engineers and data consumers to have:

Near real-time understanding of the data quality health of your data
pipelines with end-to-end monitoring, all in one place.
Profiling, detecting and correlating issues and providing
recommendations that drive rapid and focused troubleshooting
A centralized data quality model management system including rule,
metadata, scheduler etc.
Native code generation to run everywhere, including Hadoop, Kafka, Spark, etc.
One set of tools to build data quality pipelines across all eBay data platforms.

Current Status

Meritocracy

Griffin has been deployed in production at eBay and provided the
centralized data quality service for several eBay systems ( for
example, real time personalization platform, eBay real time ID linking
platform, Hadoop datasets, Site speed analytics platform). Our aim is
to build a diverse developer and user community following the Apache
meritocracy model. We will encourage contributions and participation
of all types of work, and ensure that contributors are appropriately
recognized.

Community

Currently the project is being developed at eBay. It's only for eBay
internal community. Griffin seeks to develop the developer and user
communities during incubation. We believe it will grow substantially
by becoming an Apache project.

Core Developers

Griffin is currently being designed and developed by engineers from
eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
All of these core developers have deep expertise in Apache Hadoop and
the Hadoop Ecosystem in general.

Alignment

The ASF is a natural host for Griffin given that it is already the
home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
emerging big data products. Those are requiring data quality solution
by nature to ensure the data quality which they processed. When people
use open source data technology, the big question to them is that how
we can ensure the data quality in it. Griffin leverages lot of Apache
open-source products. Griffin was designed to enable real time
insights into data quality validation by shared Infrastructure and
generic features to solve common data quality pain points.

Known Risks

Orphaned Products

The core developers of Griffin team work full time on this project.
There is no risk of Griffin getting orphaned since at least one large
company (eBay) is extensively using it in their production Hadoop and
Spark clusters for multiple data systems. For example, currently there
are 4 data systems at eBay (real time personalization platform, eBay
real time ID linking platform, Hadoop, Site speed analytics platform)
are leveraging Griffin, with more than ~600M records for data quality
status validation every day, 35 data sets being monitored, 50+ data
quality models have been created.

As Griffin is designed to connect many types of data sources, we are
very confident that they will use Griffin as a service for ensuring
the data quality in open source data ecosystems. We plan to extend and
diversify this community further through Apache.

Inexperience with Open Source

Griffin's core engineers are all active users and followers of open
source projects. They are already committers and contributors to the
Griffin Github project. All have been involved with the source code
that has been released under an open source license, and several of
them also have experience developing code in an open source
environment. Though the core set of Developers do not have Apache Open
Source experience, there are plans to onboard individuals with Apache
open source experience on to the project.

Homogenous Developers

The core developers are from eBay. Apache Incubation process
encourages an open and diverse meritocratic community. Griffin intends
to make every possible effort to build a diverse, vibrant and involved
community. We are committed to recruiting additional committers from
other companies based on their contribution to the project.

Reliance on Salaried Developers

eBay invested in Griffin as a company-wide data quality service
platform and some of its key engineers are working full time on the
project. they are all paid by eBay. We look forward to other Apache
developers and researchers to contribute to the project.

Relationships with Other Apache Products

Griffin has a strong relationship and dependency with Apache Hadoop,
Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
Hive. In addition, since there is a growing need for data quality
solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
being part of Apache’s Incubation community, could help with a closer
collaboration among these four projects and as well as others.

Documentation

Information about Griffin can be found at https://github.com/eBay/griffin

Initial Source

Griffin has been under development since early 2016 by a team of
engineers at eBay Inc. It is currently hosted on Github.com under an
Apache license 2.0 at https://github.com/eBay/griffin . Once in
incubation we will be moving the code base to apache git library.

External Dependencies

Griffin has the following external dependencies.

Basic

JDK 1.7+
Scala
Apache Maven
JUnit
Log4j
Slf4j
Apache Commons

Hadoop

Apache Hadoop
Apache HBase
Apache Hive

DB

InfluxData

Apache Spark

Spark Core Library

REST Service

Jersey
Spring MVC

Web frontend

AngularJS
jQuery
Bootstrap
RequireJS
eCharts
Font Awesome

Cryptography

Currently there's no cryptography in Griffin.

Required Resources

Mailing List

We currently use eBay mail box to communicate, but we'd like to move
that to ASF maintained mailing lists.

Current mailing list: ebay-griffin-devs@googlegroups.com

Proposed ASF maintained lists:

private@griffin.incubator.apache.org

dev@griffin.incubator.apache.org

commits@griffin.incubator.apache.org

Subversion Directory

Git is the preferred source control system.

Issue Tracking

JIRA

Other Resources

The existing code already has unit tests so we will make use of
existing Apache continuous testing infrastructure. The resulting load
should not be very large.

Initial Committers

William Go
Alex Lv
Vincent Zhao
Shawn Sha
John Liu
Liang Shao

Affiliations

The initial committers are employees of eBay Inc.

Sponsors

Champion

Henry Saputra (hsaputra@apache.org)

Nominated Mentors

Kasper Sørensen (kaspersor@apache.org)

Uma Maheswara Rao Gangumalla (umamahesh@apache.org)

Luciano Resende (luckbr1975@gmail.com)

Sponsoring Entity

We are requesting the Incubator to sponsor this project.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by Paul King <pa...@asert.com.au>.
Just the name worries me a little bit if the griffon project ever wanted to
move to Apache but they don't have any plans to do so at the moment.

http://griffon-framework.org

Cheers Paul

On 29 Nov 2016 9:35 PM, "Henry Saputra" <he...@gmail.com> wrote:

> Seemed like the discussion is calming down
> Will send VOTE thread end of day tomorrow.
>
> Thanks,
>
> Henry
>
> On Wed, Nov 23, 2016 at 3:30 PM Henry Saputra <he...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > As the champion for Griffin, I would like to bring up discussion to
> > bring the project as Apache incubator podling.
> >
> > Here is the direct quote from the abstract:
> >
> > "
> > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > Apache Spark. It provides a framework process for defining data
> > quality model, executing data quality measurement, automating data
> > profiling and validation, as well as a unified data quality
> > visualization across multiple data systems. It tries to address the
> > data quality challenges in big data and streaming context.
> > "
> >
> > Here is the link to the proposal:
> > https://wiki.apache.org/incubator/GriffinProposal
> >
> > I have copied the proposal below for easy access
> >
> >
> > Thanks,
> >
> > - Henry
> >
> >
> > Griffin Proposal
> >
> > Abstract
> >
> > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > Apache Spark. It provides a framework process for defining data
> > quality model, executing data quality measurement, automating data
> > profiling and validation, as well as a unified data quality
> > visualization across multiple data systems. It tries to address the
> > data quality challenges in big data and streaming context.
> >
> > Proposal
> >
> > Griffin is a open source Data Quality solution for distributed data
> > systems at any scale in both streaming or batch data context. When
> > people use open source products (e.g. Apache Hadoop, Apache Spark,
> > Apache Kafka, Apache Storm), they always need a data quality service
> > to build his/her confidence on data quality processed by those
> > platforms. Griffin creates a unified process to define and construct
> > data quality measurement pipeline across multiple data systems to
> > provide:
> >
> > Automatic quality validation of the data
> > Data profiling and anomaly detection
> > Data quality lineage from upstream to downstream data systems.
> > Data quality health monitoring visualization
> > Shared infrastructure resource management
> >
> > Overview of Griffin
> >
> > Griffin has been deployed in production at eBay serving major data
> > systems, it takes a platform approach to provide generic features to
> > solve common data quality validation pain points. Firstly, user can
> > register the data asset which user wants to do data quality check. The
> > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> > system or near real-time streaming data from Apache Kafka, Apache
> > Storm and other real time data platforms. Secondly, user can create
> > data quality model to define the data quality rule and metadata.
> > Thirdly, the model or rule will be executed automatically (by the
> > model engine) to get the sample data quality validation results in a
> > few seconds for streaming data. Finally, user can analyze the data
> > quality results through built-in visualization tool to take actions.
> >
> > Griffin includes:
> >
> > Data Quality Model Engine
> >
> > Griffin is model driven solution, user can choose various data quality
> > dimension to execute his/her data quality validation based on selected
> > target data-set or source data-set ( as the golden reference data). It
> > has a corresponding library supporting it in back-end for the
> > following measurement:
> >
> > Accuracy - Does data reflect the real-world objects or a verifiable
> source
> > Completeness - Is all necessary data present
> > Validity - Are all data values within the data domains specified by the
> > business
> > Timeliness - Is the data available at the time needed
> > Anomaly detection - Pre-built algorithm functions for the
> > identification of items, events or observations which do not conform
> > to an expected pattern or other items in a dataset
> > Data Profiling - Apply statistical analysis and assessment of data
> > values within a dataset for consistency, uniqueness and logic.
> >
> > Data Collection Layer
> >
> > We support two kinds of data sources, batch data and real time data.
> >
> > For batch mode, we can collect data source from Apache Hadoop based
> > platform by various data connectors.
> >
> > For real time mode, we can connect with messaging system like Kafka to
> > near real time analysis.
> >
> > Data Process and Storage Layer
> >
> > For batch analysis, our data quality model will compute data quality
> > metrics in our spark cluster based on data source in Apache Hadoop.
> >
> > For near real time analysis, we consume data from messaging system,
> > then our data quality model will compute our real time data quality
> > metrics in our spark cluster. for data storage, we use time series
> > database in our back end to fulfill front end request.
> >
> > Griffin Service
> >
> > We have RESTful web services to accomplish all the functionalities of
> > Griffin, such as register data asset, create data quality model,
> > publish metrics, retrieve metrics, add subscription, etc. So, the
> > developers can develop their own user interface based on these web
> > services.
> >
> > Background
> >
> > At eBay, when people play with big data in Apache Hadoop (or other
> > streaming data), data quality often becomes one big challenge.
> > Different teams have built customized data quality tools to detect and
> > analyze data quality issues within their own domain. We are thinking
> > to take a platform approach to provide shared Infrastructure and
> > generic features to solve common data quality pain points. This would
> > enable us to build trusted data assets.
> >
> > Currently it’s very difficult and costly to do data quality validation
> > when we have big data flow across multi-platforms at eBay (e.g.
> > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> > MongoDB). Take eBay real time personalization platform as an example.
> > Every day we have to validate data quality status for ~600M records (
> > imagine we have 150M active users for our website). Data quality often
> > becomes one big challenge both in its streaming and batch pipelines.
> >
> > So we conclude 3 data quality problems at eBay:
> >
> > Lack of end2end unified view of data quality measurement from multiple
> > data sources to target applications, it usually takes a long time to
> > identify and fix poor data quality.
> > How to get data quality measured in streaming mode, we need to have a
> > process and tool to visualize data quality insights through
> > registering dataset which you want to check data quality, creating
> > data quality measurement model, executing the data quality validation
> > job and getting metrics insights for action taking.
> > No Shared platform and API Service, have to apply and manage own
> > hardware and software infrastructure.
> >
> > Rationale
> >
> > The challenge we face at eBay is that our data volume is becoming
> > bigger and bigger, system processes become more complex, while we do
> > not have a unified data quality solution to ensure the trusted data
> > sets which provide confidences on data quality to our data consumers.
> > The key challenges on data quality includes:
> >
> > Existing commercial data quality solution cannot address data quality
> > lineage among systems, cannot scale out to support fast growing data
> > at eBay
> > Existing eBay's domain specific tools take a long time to identify and
> > fix poor data quality when data flowed through multiple systems
> > Business logic becomes complex, requires data quality system much
> flexible.
> >
> > Some data quality issues do have business impact on user experiences,
> > revenue, efficiency & compliance.
> >
> > Communication overhead of data quality metrics, typically in a big
> > organization, which involve different teams.
> >
> > The idea of Griffin is to provide Data Quality validation as a
> > Service, to allow data engineers and data consumers to have:
> >
> > Near real-time understanding of the data quality health of your data
> > pipelines with end-to-end monitoring, all in one place.
> > Profiling, detecting and correlating issues and providing
> > recommendations that drive rapid and focused troubleshooting
> > A centralized data quality model management system including rule,
> > metadata, scheduler etc.
> > Native code generation to run everywhere, including Hadoop, Kafka, Spark,
> > etc.
> > One set of tools to build data quality pipelines across all eBay data
> > platforms.
> >
> > Current Status
> >
> > Meritocracy
> >
> > Griffin has been deployed in production at eBay and provided the
> > centralized data quality service for several eBay systems ( for
> > example, real time personalization platform, eBay real time ID linking
> > platform, Hadoop datasets, Site speed analytics platform). Our aim is
> > to build a diverse developer and user community following the Apache
> > meritocracy model. We will encourage contributions and participation
> > of all types of work, and ensure that contributors are appropriately
> > recognized.
> >
> > Community
> >
> > Currently the project is being developed at eBay. It's only for eBay
> > internal community. Griffin seeks to develop the developer and user
> > communities during incubation. We believe it will grow substantially
> > by becoming an Apache project.
> >
> > Core Developers
> >
> > Griffin is currently being designed and developed by engineers from
> > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> > All of these core developers have deep expertise in Apache Hadoop and
> > the Hadoop Ecosystem in general.
> >
> > Alignment
> >
> > The ASF is a natural host for Griffin given that it is already the
> > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> > emerging big data products. Those are requiring data quality solution
> > by nature to ensure the data quality which they processed. When people
> > use open source data technology, the big question to them is that how
> > we can ensure the data quality in it. Griffin leverages lot of Apache
> > open-source products. Griffin was designed to enable real time
> > insights into data quality validation by shared Infrastructure and
> > generic features to solve common data quality pain points.
> >
> > Known Risks
> >
> > Orphaned Products
> >
> > The core developers of Griffin team work full time on this project.
> > There is no risk of Griffin getting orphaned since at least one large
> > company (eBay) is extensively using it in their production Hadoop and
> > Spark clusters for multiple data systems. For example, currently there
> > are 4 data systems at eBay (real time personalization platform, eBay
> > real time ID linking platform, Hadoop, Site speed analytics platform)
> > are leveraging Griffin, with more than ~600M records for data quality
> > status validation every day, 35 data sets being monitored, 50+ data
> > quality models have been created.
> >
> > As Griffin is designed to connect many types of data sources, we are
> > very confident that they will use Griffin as a service for ensuring
> > the data quality in open source data ecosystems. We plan to extend and
> > diversify this community further through Apache.
> >
> > Inexperience with Open Source
> >
> > Griffin's core engineers are all active users and followers of open
> > source projects. They are already committers and contributors to the
> > Griffin Github project. All have been involved with the source code
> > that has been released under an open source license, and several of
> > them also have experience developing code in an open source
> > environment. Though the core set of Developers do not have Apache Open
> > Source experience, there are plans to onboard individuals with Apache
> > open source experience on to the project.
> >
> > Homogenous Developers
> >
> > The core developers are from eBay. Apache Incubation process
> > encourages an open and diverse meritocratic community. Griffin intends
> > to make every possible effort to build a diverse, vibrant and involved
> > community. We are committed to recruiting additional committers from
> > other companies based on their contribution to the project.
> >
> > Reliance on Salaried Developers
> >
> > eBay invested in Griffin as a company-wide data quality service
> > platform and some of its key engineers are working full time on the
> > project. they are all paid by eBay. We look forward to other Apache
> > developers and researchers to contribute to the project.
> >
> > Relationships with Other Apache Products
> >
> > Griffin has a strong relationship and dependency with Apache Hadoop,
> > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> > Hive. In addition, since there is a growing need for data quality
> > solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> > being part of Apache’s Incubation community, could help with a closer
> > collaboration among these four projects and as well as others.
> >
> > Documentation
> >
> > Information about Griffin can be found at https://github.com/eBay/
> griffin
> >
> > Initial Source
> >
> > Griffin has been under development since early 2016 by a team of
> > engineers at eBay Inc. It is currently hosted on Github.com under an
> > Apache license 2.0 at https://github.com/eBay/griffin . Once in
> > incubation we will be moving the code base to apache git library.
> >
> > External Dependencies
> >
> > Griffin has the following external dependencies.
> >
> > Basic
> >
> > JDK 1.7+
> > Scala
> > Apache Maven
> > JUnit
> > Log4j
> > Slf4j
> > Apache Commons
> >
> > Hadoop
> >
> > Apache Hadoop
> > Apache HBase
> > Apache Hive
> >
> > DB
> >
> > InfluxData
> >
> > Apache Spark
> >
> > Spark Core Library
> >
> > REST Service
> >
> > Jersey
> > Spring MVC
> >
> > Web frontend
> >
> > AngularJS
> > jQuery
> > Bootstrap
> > RequireJS
> > eCharts
> > Font Awesome
> >
> > Cryptography
> >
> > Currently there's no cryptography in Griffin.
> >
> > Required Resources
> >
> > Mailing List
> >
> > We currently use eBay mail box to communicate, but we'd like to move
> > that to ASF maintained mailing lists.
> >
> > Current mailing list: ebay-griffin-devs@googlegroups.com
> >
> > Proposed ASF maintained lists:
> >
> > private@griffin.incubator.apache.org
> >
> > dev@griffin.incubator.apache.org
> >
> > commits@griffin.incubator.apache.org
> >
> > Subversion Directory
> >
> > Git is the preferred source control system.
> >
> > Issue Tracking
> >
> > JIRA
> >
> > Other Resources
> >
> > The existing code already has unit tests so we will make use of
> > existing Apache continuous testing infrastructure. The resulting load
> > should not be very large.
> >
> > Initial Committers
> >
> > William Go
> > Alex Lv
> > Vincent Zhao
> > Shawn Sha
> > John Liu
> > Liang Shao
> >
> > Affiliations
> >
> > The initial committers are employees of eBay Inc.
> >
> > Sponsors
> >
> > Champion
> >
> > Henry Saputra (hsaputra@apache.org)
> >
> > Nominated Mentors
> >
> > Kasper Sørensen (kaspersor@apache.org)
> >
> > Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
> >
> > Luciano Resende (luckbr1975@gmail.com)
> >
> > Sponsoring Entity
> >
> > We are requesting the Incubator to sponsor this project.
> >
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by Henry Saputra <he...@gmail.com>.
Seemed like the discussion is calming down
Will send VOTE thread end of day tomorrow.

Thanks,

Henry

On Wed, Nov 23, 2016 at 3:30 PM Henry Saputra <he...@gmail.com>
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to bring up discussion to
> bring the project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Here is the link to the proposal:
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish all the functionalities of
> Griffin, such as register data asset, create data quality model,
> publish metrics, retrieve metrics, add subscription, etc. So, the
> developers can develop their own user interface based on these web
> services.
>
> Background
>
> At eBay, when people play with big data in Apache Hadoop (or other
> streaming data), data quality often becomes one big challenge.
> Different teams have built customized data quality tools to detect and
> analyze data quality issues within their own domain. We are thinking
> to take a platform approach to provide shared Infrastructure and
> generic features to solve common data quality pain points. This would
> enable us to build trusted data assets.
>
> Currently it’s very difficult and costly to do data quality validation
> when we have big data flow across multi-platforms at eBay (e.g.
> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> MongoDB). Take eBay real time personalization platform as an example.
> Every day we have to validate data quality status for ~600M records (
> imagine we have 150M active users for our website). Data quality often
> becomes one big challenge both in its streaming and batch pipelines.
>
> So we conclude 3 data quality problems at eBay:
>
> Lack of end2end unified view of data quality measurement from multiple
> data sources to target applications, it usually takes a long time to
> identify and fix poor data quality.
> How to get data quality measured in streaming mode, we need to have a
> process and tool to visualize data quality insights through
> registering dataset which you want to check data quality, creating
> data quality measurement model, executing the data quality validation
> job and getting metrics insights for action taking.
> No Shared platform and API Service, have to apply and manage own
> hardware and software infrastructure.
>
> Rationale
>
> The challenge we face at eBay is that our data volume is becoming
> bigger and bigger, system processes become more complex, while we do
> not have a unified data quality solution to ensure the trusted data
> sets which provide confidences on data quality to our data consumers.
> The key challenges on data quality includes:
>
> Existing commercial data quality solution cannot address data quality
> lineage among systems, cannot scale out to support fast growing data
> at eBay
> Existing eBay's domain specific tools take a long time to identify and
> fix poor data quality when data flowed through multiple systems
> Business logic becomes complex, requires data quality system much flexible.
>
> Some data quality issues do have business impact on user experiences,
> revenue, efficiency & compliance.
>
> Communication overhead of data quality metrics, typically in a big
> organization, which involve different teams.
>
> The idea of Griffin is to provide Data Quality validation as a
> Service, to allow data engineers and data consumers to have:
>
> Near real-time understanding of the data quality health of your data
> pipelines with end-to-end monitoring, all in one place.
> Profiling, detecting and correlating issues and providing
> recommendations that drive rapid and focused troubleshooting
> A centralized data quality model management system including rule,
> metadata, scheduler etc.
> Native code generation to run everywhere, including Hadoop, Kafka, Spark,
> etc.
> One set of tools to build data quality pipelines across all eBay data
> platforms.
>
> Current Status
>
> Meritocracy
>
> Griffin has been deployed in production at eBay and provided the
> centralized data quality service for several eBay systems ( for
> example, real time personalization platform, eBay real time ID linking
> platform, Hadoop datasets, Site speed analytics platform). Our aim is
> to build a diverse developer and user community following the Apache
> meritocracy model. We will encourage contributions and participation
> of all types of work, and ensure that contributors are appropriately
> recognized.
>
> Community
>
> Currently the project is being developed at eBay. It's only for eBay
> internal community. Griffin seeks to develop the developer and user
> communities during incubation. We believe it will grow substantially
> by becoming an Apache project.
>
> Core Developers
>
> Griffin is currently being designed and developed by engineers from
> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> All of these core developers have deep expertise in Apache Hadoop and
> the Hadoop Ecosystem in general.
>
> Alignment
>
> The ASF is a natural host for Griffin given that it is already the
> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> emerging big data products. Those are requiring data quality solution
> by nature to ensure the data quality which they processed. When people
> use open source data technology, the big question to them is that how
> we can ensure the data quality in it. Griffin leverages lot of Apache
> open-source products. Griffin was designed to enable real time
> insights into data quality validation by shared Infrastructure and
> generic features to solve common data quality pain points.
>
> Known Risks
>
> Orphaned Products
>
> The core developers of Griffin team work full time on this project.
> There is no risk of Griffin getting orphaned since at least one large
> company (eBay) is extensively using it in their production Hadoop and
> Spark clusters for multiple data systems. For example, currently there
> are 4 data systems at eBay (real time personalization platform, eBay
> real time ID linking platform, Hadoop, Site speed analytics platform)
> are leveraging Griffin, with more than ~600M records for data quality
> status validation every day, 35 data sets being monitored, 50+ data
> quality models have been created.
>
> As Griffin is designed to connect many types of data sources, we are
> very confident that they will use Griffin as a service for ensuring
> the data quality in open source data ecosystems. We plan to extend and
> diversify this community further through Apache.
>
> Inexperience with Open Source
>
> Griffin's core engineers are all active users and followers of open
> source projects. They are already committers and contributors to the
> Griffin Github project. All have been involved with the source code
> that has been released under an open source license, and several of
> them also have experience developing code in an open source
> environment. Though the core set of Developers do not have Apache Open
> Source experience, there are plans to onboard individuals with Apache
> open source experience on to the project.
>
> Homogenous Developers
>
> The core developers are from eBay. Apache Incubation process
> encourages an open and diverse meritocratic community. Griffin intends
> to make every possible effort to build a diverse, vibrant and involved
> community. We are committed to recruiting additional committers from
> other companies based on their contribution to the project.
>
> Reliance on Salaried Developers
>
> eBay invested in Griffin as a company-wide data quality service
> platform and some of its key engineers are working full time on the
> project. they are all paid by eBay. We look forward to other Apache
> developers and researchers to contribute to the project.
>
> Relationships with Other Apache Products
>
> Griffin has a strong relationship and dependency with Apache Hadoop,
> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> Hive. In addition, since there is a growing need for data quality
> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> being part of Apache’s Incubation community, could help with a closer
> collaboration among these four projects and as well as others.
>
> Documentation
>
> Information about Griffin can be found at https://github.com/eBay/griffin
>
> Initial Source
>
> Griffin has been under development since early 2016 by a team of
> engineers at eBay Inc. It is currently hosted on Github.com under an
> Apache license 2.0 at https://github.com/eBay/griffin . Once in
> incubation we will be moving the code base to apache git library.
>
> External Dependencies
>
> Griffin has the following external dependencies.
>
> Basic
>
> JDK 1.7+
> Scala
> Apache Maven
> JUnit
> Log4j
> Slf4j
> Apache Commons
>
> Hadoop
>
> Apache Hadoop
> Apache HBase
> Apache Hive
>
> DB
>
> InfluxData
>
> Apache Spark
>
> Spark Core Library
>
> REST Service
>
> Jersey
> Spring MVC
>
> Web frontend
>
> AngularJS
> jQuery
> Bootstrap
> RequireJS
> eCharts
> Font Awesome
>
> Cryptography
>
> Currently there's no cryptography in Griffin.
>
> Required Resources
>
> Mailing List
>
> We currently use eBay mail box to communicate, but we'd like to move
> that to ASF maintained mailing lists.
>
> Current mailing list: ebay-griffin-devs@googlegroups.com
>
> Proposed ASF maintained lists:
>
> private@griffin.incubator.apache.org
>
> dev@griffin.incubator.apache.org
>
> commits@griffin.incubator.apache.org
>
> Subversion Directory
>
> Git is the preferred source control system.
>
> Issue Tracking
>
> JIRA
>
> Other Resources
>
> The existing code already has unit tests so we will make use of
> existing Apache continuous testing infrastructure. The resulting load
> should not be very large.
>
> Initial Committers
>
> William Go
> Alex Lv
> Vincent Zhao
> Shawn Sha
> John Liu
> Liang Shao
>
> Affiliations
>
> The initial committers are employees of eBay Inc.
>
> Sponsors
>
> Champion
>
> Henry Saputra (hsaputra@apache.org)
>
> Nominated Mentors
>
> Kasper Sørensen (kaspersor@apache.org)
>
> Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
>
> Luciano Resende (luckbr1975@gmail.com)
>
> Sponsoring Entity
>
> We are requesting the Incubator to sponsor this project.
>

答复: [DISCUSS] Proposing Griffin for Apache incubator

Posted by 吕 志兴 <Lu...@hotmail.com>.
No, none of the 3 mentors is from eBay.


Kasper Sørensen(kaspersor@apache.org<ma...@apache.org>)

Uma Maheswara Rao Gangumalla(umamahesh@apache.org<ma...@apache.org>)

Luciano Resende(luckbr1975@gmail.com<ma...@gmail.com>)


Thanks.

Alex

________________________________
发件人: John D. Ament <jo...@apache.org>
发送时间: 2016年11月24日 9:06:48
收件人: general@incubator.apache.org
主题: Re: [DISCUSS] Proposing Griffin for Apache incubator

Ah shoot ok :-)
I'm used to seeing it next to the committer's names.  I guess that works
just as well.

Are the mentors all eBay as well?

John

On Wed, Nov 23, 2016 at 8:04 PM Henry Saputra <he...@gmail.com>
wrote:

> Hi John,
>
> We have added this comment in the proposal:
>
> "
> The initial committers are employees of eBay Inc.
> "
>
> - Henry
>
> On Wed, Nov 23, 2016 at 4:50 PM, John D. Ament <jo...@apache.org>
> wrote:
>
> > Henry,
> >
> > Can you add initial committer affiliations to the proposal?
> >
> > John
> >
> > On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <he...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > As the champion for Griffin, I would like to bring up discussion to
> > > bring the project as Apache incubator podling.
> > >
> > > Here is the direct quote from the abstract:
> > >
> > > "
> > > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > > Apache Spark. It provides a framework process for defining data
> > > quality model, executing data quality measurement, automating data
> > > profiling and validation, as well as a unified data quality
> > > visualization across multiple data systems. It tries to address the
> > > data quality challenges in big data and streaming context.
> > > "
> > >
> > > Here is the link to the proposal:
> > > https://wiki.apache.org/incubator/GriffinProposal
> > >
> > > I have copied the proposal below for easy access
> > >
> > >
> > > Thanks,
> > >
> > > - Henry
> > >
> > >
> > > Griffin Proposal
> > >
> > > Abstract
> > >
> > > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > > Apache Spark. It provides a framework process for defining data
> > > quality model, executing data quality measurement, automating data
> > > profiling and validation, as well as a unified data quality
> > > visualization across multiple data systems. It tries to address the
> > > data quality challenges in big data and streaming context.
> > >
> > > Proposal
> > >
> > > Griffin is a open source Data Quality solution for distributed data
> > > systems at any scale in both streaming or batch data context. When
> > > people use open source products (e.g. Apache Hadoop, Apache Spark,
> > > Apache Kafka, Apache Storm), they always need a data quality service
> > > to build his/her confidence on data quality processed by those
> > > platforms. Griffin creates a unified process to define and construct
> > > data quality measurement pipeline across multiple data systems to
> > > provide:
> > >
> > > Automatic quality validation of the data
> > > Data profiling and anomaly detection
> > > Data quality lineage from upstream to downstream data systems.
> > > Data quality health monitoring visualization
> > > Shared infrastructure resource management
> > >
> > > Overview of Griffin
> > >
> > > Griffin has been deployed in production at eBay serving major data
> > > systems, it takes a platform approach to provide generic features to
> > > solve common data quality validation pain points. Firstly, user can
> > > register the data asset which user wants to do data quality check. The
> > > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> > > system or near real-time streaming data from Apache Kafka, Apache
> > > Storm and other real time data platforms. Secondly, user can create
> > > data quality model to define the data quality rule and metadata.
> > > Thirdly, the model or rule will be executed automatically (by the
> > > model engine) to get the sample data quality validation results in a
> > > few seconds for streaming data. Finally, user can analyze the data
> > > quality results through built-in visualization tool to take actions.
> > >
> > > Griffin includes:
> > >
> > > Data Quality Model Engine
> > >
> > > Griffin is model driven solution, user can choose various data quality
> > > dimension to execute his/her data quality validation based on selected
> > > target data-set or source data-set ( as the golden reference data). It
> > > has a corresponding library supporting it in back-end for the
> > > following measurement:
> > >
> > > Accuracy - Does data reflect the real-world objects or a verifiable
> > source
> > > Completeness - Is all necessary data present
> > > Validity - Are all data values within the data domains specified by the
> > > business
> > > Timeliness - Is the data available at the time needed
> > > Anomaly detection - Pre-built algorithm functions for the
> > > identification of items, events or observations which do not conform
> > > to an expected pattern or other items in a dataset
> > > Data Profiling - Apply statistical analysis and assessment of data
> > > values within a dataset for consistency, uniqueness and logic.
> > >
> > > Data Collection Layer
> > >
> > > We support two kinds of data sources, batch data and real time data.
> > >
> > > For batch mode, we can collect data source from Apache Hadoop based
> > > platform by various data connectors.
> > >
> > > For real time mode, we can connect with messaging system like Kafka to
> > > near real time analysis.
> > >
> > > Data Process and Storage Layer
> > >
> > > For batch analysis, our data quality model will compute data quality
> > > metrics in our spark cluster based on data source in Apache Hadoop.
> > >
> > > For near real time analysis, we consume data from messaging system,
> > > then our data quality model will compute our real time data quality
> > > metrics in our spark cluster. for data storage, we use time series
> > > database in our back end to fulfill front end request.
> > >
> > > Griffin Service
> > >
> > > We have RESTful web services to accomplish all the functionalities of
> > > Griffin, such as register data asset, create data quality model,
> > > publish metrics, retrieve metrics, add subscription, etc. So, the
> > > developers can develop their own user interface based on these web
> > > services.
> > >
> > > Background
> > >
> > > At eBay, when people play with big data in Apache Hadoop (or other
> > > streaming data), data quality often becomes one big challenge.
> > > Different teams have built customized data quality tools to detect and
> > > analyze data quality issues within their own domain. We are thinking
> > > to take a platform approach to provide shared Infrastructure and
> > > generic features to solve common data quality pain points. This would
> > > enable us to build trusted data assets.
> > >
> > > Currently it’s very difficult and costly to do data quality validation
> > > when we have big data flow across multi-platforms at eBay (e.g.
> > > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> > > MongoDB). Take eBay real time personalization platform as an example.
> > > Every day we have to validate data quality status for ~600M records (
> > > imagine we have 150M active users for our website). Data quality often
> > > becomes one big challenge both in its streaming and batch pipelines.
> > >
> > > So we conclude 3 data quality problems at eBay:
> > >
> > > Lack of end2end unified view of data quality measurement from multiple
> > > data sources to target applications, it usually takes a long time to
> > > identify and fix poor data quality.
> > > How to get data quality measured in streaming mode, we need to have a
> > > process and tool to visualize data quality insights through
> > > registering dataset which you want to check data quality, creating
> > > data quality measurement model, executing the data quality validation
> > > job and getting metrics insights for action taking.
> > > No Shared platform and API Service, have to apply and manage own
> > > hardware and software infrastructure.
> > >
> > > Rationale
> > >
> > > The challenge we face at eBay is that our data volume is becoming
> > > bigger and bigger, system processes become more complex, while we do
> > > not have a unified data quality solution to ensure the trusted data
> > > sets which provide confidences on data quality to our data consumers.
> > > The key challenges on data quality includes:
> > >
> > > Existing commercial data quality solution cannot address data quality
> > > lineage among systems, cannot scale out to support fast growing data
> > > at eBay
> > > Existing eBay's domain specific tools take a long time to identify and
> > > fix poor data quality when data flowed through multiple systems
> > > Business logic becomes complex, requires data quality system much
> > flexible.
> > >
> > > Some data quality issues do have business impact on user experiences,
> > > revenue, efficiency & compliance.
> > >
> > > Communication overhead of data quality metrics, typically in a big
> > > organization, which involve different teams.
> > >
> > > The idea of Griffin is to provide Data Quality validation as a
> > > Service, to allow data engineers and data consumers to have:
> > >
> > > Near real-time understanding of the data quality health of your data
> > > pipelines with end-to-end monitoring, all in one place.
> > > Profiling, detecting and correlating issues and providing
> > > recommendations that drive rapid and focused troubleshooting
> > > A centralized data quality model management system including rule,
> > > metadata, scheduler etc.
> > > Native code generation to run everywhere, including Hadoop, Kafka,
> Spark,
> > > etc.
> > > One set of tools to build data quality pipelines across all eBay data
> > > platforms.
> > >
> > > Current Status
> > >
> > > Meritocracy
> > >
> > > Griffin has been deployed in production at eBay and provided the
> > > centralized data quality service for several eBay systems ( for
> > > example, real time personalization platform, eBay real time ID linking
> > > platform, Hadoop datasets, Site speed analytics platform). Our aim is
> > > to build a diverse developer and user community following the Apache
> > > meritocracy model. We will encourage contributions and participation
> > > of all types of work, and ensure that contributors are appropriately
> > > recognized.
> > >
> > > Community
> > >
> > > Currently the project is being developed at eBay. It's only for eBay
> > > internal community. Griffin seeks to develop the developer and user
> > > communities during incubation. We believe it will grow substantially
> > > by becoming an Apache project.
> > >
> > > Core Developers
> > >
> > > Griffin is currently being designed and developed by engineers from
> > > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> > > All of these core developers have deep expertise in Apache Hadoop and
> > > the Hadoop Ecosystem in general.
> > >
> > > Alignment
> > >
> > > The ASF is a natural host for Griffin given that it is already the
> > > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> > > emerging big data products. Those are requiring data quality solution
> > > by nature to ensure the data quality which they processed. When people
> > > use open source data technology, the big question to them is that how
> > > we can ensure the data quality in it. Griffin leverages lot of Apache
> > > open-source products. Griffin was designed to enable real time
> > > insights into data quality validation by shared Infrastructure and
> > > generic features to solve common data quality pain points.
> > >
> > > Known Risks
> > >
> > > Orphaned Products
> > >
> > > The core developers of Griffin team work full time on this project.
> > > There is no risk of Griffin getting orphaned since at least one large
> > > company (eBay) is extensively using it in their production Hadoop and
> > > Spark clusters for multiple data systems. For example, currently there
> > > are 4 data systems at eBay (real time personalization platform, eBay
> > > real time ID linking platform, Hadoop, Site speed analytics platform)
> > > are leveraging Griffin, with more than ~600M records for data quality
> > > status validation every day, 35 data sets being monitored, 50+ data
> > > quality models have been created.
> > >
> > > As Griffin is designed to connect many types of data sources, we are
> > > very confident that they will use Griffin as a service for ensuring
> > > the data quality in open source data ecosystems. We plan to extend and
> > > diversify this community further through Apache.
> > >
> > > Inexperience with Open Source
> > >
> > > Griffin's core engineers are all active users and followers of open
> > > source projects. They are already committers and contributors to the
> > > Griffin Github project. All have been involved with the source code
> > > that has been released under an open source license, and several of
> > > them also have experience developing code in an open source
> > > environment. Though the core set of Developers do not have Apache Open
> > > Source experience, there are plans to onboard individuals with Apache
> > > open source experience on to the project.
> > >
> > > Homogenous Developers
> > >
> > > The core developers are from eBay. Apache Incubation process
> > > encourages an open and diverse meritocratic community. Griffin intends
> > > to make every possible effort to build a diverse, vibrant and involved
> > > community. We are committed to recruiting additional committers from
> > > other companies based on their contribution to the project.
> > >
> > > Reliance on Salaried Developers
> > >
> > > eBay invested in Griffin as a company-wide data quality service
> > > platform and some of its key engineers are working full time on the
> > > project. they are all paid by eBay. We look forward to other Apache
> > > developers and researchers to contribute to the project.
> > >
> > > Relationships with Other Apache Products
> > >
> > > Griffin has a strong relationship and dependency with Apache Hadoop,
> > > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> > > Hive. In addition, since there is a growing need for data quality
> > > solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> > > being part of Apache’s Incubation community, could help with a closer
> > > collaboration among these four projects and as well as others.
> > >
> > > Documentation
> > >
> > > Information about Griffin can be found at https://github.com/eBay/
> > griffin
> > >
> > > Initial Source
> > >
> > > Griffin has been under development since early 2016 by a team of
> > > engineers at eBay Inc. It is currently hosted on Github.com under an
> > > Apache license 2.0 at https://github.com/eBay/griffin . Once in
> > > incubation we will be moving the code base to apache git library.
> > >
> > > External Dependencies
> > >
> > > Griffin has the following external dependencies.
> > >
> > > Basic
> > >
> > > JDK 1.7+
> > > Scala
> > > Apache Maven
> > > JUnit
> > > Log4j
> > > Slf4j
> > > Apache Commons
> > >
> > > Hadoop
> > >
> > > Apache Hadoop
> > > Apache HBase
> > > Apache Hive
> > >
> > > DB
> > >
> > > InfluxData
> > >
> > > Apache Spark
> > >
> > > Spark Core Library
> > >
> > > REST Service
> > >
> > > Jersey
> > > Spring MVC
> > >
> > > Web frontend
> > >
> > > AngularJS
> > > jQuery
> > > Bootstrap
> > > RequireJS
> > > eCharts
> > > Font Awesome
> > >
> > > Cryptography
> > >
> > > Currently there's no cryptography in Griffin.
> > >
> > > Required Resources
> > >
> > > Mailing List
> > >
> > > We currently use eBay mail box to communicate, but we'd like to move
> > > that to ASF maintained mailing lists.
> > >
> > > Current mailing list: ebay-griffin-devs@googlegroups.com
> > >
> > > Proposed ASF maintained lists:
> > >
> > > private@griffin.incubator.apache.org
> > >
> > > dev@griffin.incubator.apache.org
> > >
> > > commits@griffin.incubator.apache.org
> > >
> > > Subversion Directory
> > >
> > > Git is the preferred source control system.
> > >
> > > Issue Tracking
> > >
> > > JIRA
> > >
> > > Other Resources
> > >
> > > The existing code already has unit tests so we will make use of
> > > existing Apache continuous testing infrastructure. The resulting load
> > > should not be very large.
> > >
> > > Initial Committers
> > >
> > > William Go
> > > Alex Lv
> > > Vincent Zhao
> > > Shawn Sha
> > > John Liu
> > > Liang Shao
> > >
> > > Affiliations
> > >
> > > The initial committers are employees of eBay Inc.
> > >
> > > Sponsors
> > >
> > > Champion
> > >
> > > Henry Saputra (hsaputra@apache.org)
> > >
> > > Nominated Mentors
> > >
> > > Kasper Sørensen (kaspersor@apache.org)
> > >
> > > Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
> > >
> > > Luciano Resende (luckbr1975@gmail.com)
> > >
> > > Sponsoring Entity
> > >
> > > We are requesting the Incubator to sponsor this project.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by "John D. Ament" <jo...@apache.org>.
Ah shoot ok :-)
I'm used to seeing it next to the committer's names.  I guess that works
just as well.

Are the mentors all eBay as well?

John

On Wed, Nov 23, 2016 at 8:04 PM Henry Saputra <he...@gmail.com>
wrote:

> Hi John,
>
> We have added this comment in the proposal:
>
> "
> The initial committers are employees of eBay Inc.
> "
>
> - Henry
>
> On Wed, Nov 23, 2016 at 4:50 PM, John D. Ament <jo...@apache.org>
> wrote:
>
> > Henry,
> >
> > Can you add initial committer affiliations to the proposal?
> >
> > John
> >
> > On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <he...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > As the champion for Griffin, I would like to bring up discussion to
> > > bring the project as Apache incubator podling.
> > >
> > > Here is the direct quote from the abstract:
> > >
> > > "
> > > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > > Apache Spark. It provides a framework process for defining data
> > > quality model, executing data quality measurement, automating data
> > > profiling and validation, as well as a unified data quality
> > > visualization across multiple data systems. It tries to address the
> > > data quality challenges in big data and streaming context.
> > > "
> > >
> > > Here is the link to the proposal:
> > > https://wiki.apache.org/incubator/GriffinProposal
> > >
> > > I have copied the proposal below for easy access
> > >
> > >
> > > Thanks,
> > >
> > > - Henry
> > >
> > >
> > > Griffin Proposal
> > >
> > > Abstract
> > >
> > > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > > Apache Spark. It provides a framework process for defining data
> > > quality model, executing data quality measurement, automating data
> > > profiling and validation, as well as a unified data quality
> > > visualization across multiple data systems. It tries to address the
> > > data quality challenges in big data and streaming context.
> > >
> > > Proposal
> > >
> > > Griffin is a open source Data Quality solution for distributed data
> > > systems at any scale in both streaming or batch data context. When
> > > people use open source products (e.g. Apache Hadoop, Apache Spark,
> > > Apache Kafka, Apache Storm), they always need a data quality service
> > > to build his/her confidence on data quality processed by those
> > > platforms. Griffin creates a unified process to define and construct
> > > data quality measurement pipeline across multiple data systems to
> > > provide:
> > >
> > > Automatic quality validation of the data
> > > Data profiling and anomaly detection
> > > Data quality lineage from upstream to downstream data systems.
> > > Data quality health monitoring visualization
> > > Shared infrastructure resource management
> > >
> > > Overview of Griffin
> > >
> > > Griffin has been deployed in production at eBay serving major data
> > > systems, it takes a platform approach to provide generic features to
> > > solve common data quality validation pain points. Firstly, user can
> > > register the data asset which user wants to do data quality check. The
> > > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> > > system or near real-time streaming data from Apache Kafka, Apache
> > > Storm and other real time data platforms. Secondly, user can create
> > > data quality model to define the data quality rule and metadata.
> > > Thirdly, the model or rule will be executed automatically (by the
> > > model engine) to get the sample data quality validation results in a
> > > few seconds for streaming data. Finally, user can analyze the data
> > > quality results through built-in visualization tool to take actions.
> > >
> > > Griffin includes:
> > >
> > > Data Quality Model Engine
> > >
> > > Griffin is model driven solution, user can choose various data quality
> > > dimension to execute his/her data quality validation based on selected
> > > target data-set or source data-set ( as the golden reference data). It
> > > has a corresponding library supporting it in back-end for the
> > > following measurement:
> > >
> > > Accuracy - Does data reflect the real-world objects or a verifiable
> > source
> > > Completeness - Is all necessary data present
> > > Validity - Are all data values within the data domains specified by the
> > > business
> > > Timeliness - Is the data available at the time needed
> > > Anomaly detection - Pre-built algorithm functions for the
> > > identification of items, events or observations which do not conform
> > > to an expected pattern or other items in a dataset
> > > Data Profiling - Apply statistical analysis and assessment of data
> > > values within a dataset for consistency, uniqueness and logic.
> > >
> > > Data Collection Layer
> > >
> > > We support two kinds of data sources, batch data and real time data.
> > >
> > > For batch mode, we can collect data source from Apache Hadoop based
> > > platform by various data connectors.
> > >
> > > For real time mode, we can connect with messaging system like Kafka to
> > > near real time analysis.
> > >
> > > Data Process and Storage Layer
> > >
> > > For batch analysis, our data quality model will compute data quality
> > > metrics in our spark cluster based on data source in Apache Hadoop.
> > >
> > > For near real time analysis, we consume data from messaging system,
> > > then our data quality model will compute our real time data quality
> > > metrics in our spark cluster. for data storage, we use time series
> > > database in our back end to fulfill front end request.
> > >
> > > Griffin Service
> > >
> > > We have RESTful web services to accomplish all the functionalities of
> > > Griffin, such as register data asset, create data quality model,
> > > publish metrics, retrieve metrics, add subscription, etc. So, the
> > > developers can develop their own user interface based on these web
> > > services.
> > >
> > > Background
> > >
> > > At eBay, when people play with big data in Apache Hadoop (or other
> > > streaming data), data quality often becomes one big challenge.
> > > Different teams have built customized data quality tools to detect and
> > > analyze data quality issues within their own domain. We are thinking
> > > to take a platform approach to provide shared Infrastructure and
> > > generic features to solve common data quality pain points. This would
> > > enable us to build trusted data assets.
> > >
> > > Currently it’s very difficult and costly to do data quality validation
> > > when we have big data flow across multi-platforms at eBay (e.g.
> > > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> > > MongoDB). Take eBay real time personalization platform as an example.
> > > Every day we have to validate data quality status for ~600M records (
> > > imagine we have 150M active users for our website). Data quality often
> > > becomes one big challenge both in its streaming and batch pipelines.
> > >
> > > So we conclude 3 data quality problems at eBay:
> > >
> > > Lack of end2end unified view of data quality measurement from multiple
> > > data sources to target applications, it usually takes a long time to
> > > identify and fix poor data quality.
> > > How to get data quality measured in streaming mode, we need to have a
> > > process and tool to visualize data quality insights through
> > > registering dataset which you want to check data quality, creating
> > > data quality measurement model, executing the data quality validation
> > > job and getting metrics insights for action taking.
> > > No Shared platform and API Service, have to apply and manage own
> > > hardware and software infrastructure.
> > >
> > > Rationale
> > >
> > > The challenge we face at eBay is that our data volume is becoming
> > > bigger and bigger, system processes become more complex, while we do
> > > not have a unified data quality solution to ensure the trusted data
> > > sets which provide confidences on data quality to our data consumers.
> > > The key challenges on data quality includes:
> > >
> > > Existing commercial data quality solution cannot address data quality
> > > lineage among systems, cannot scale out to support fast growing data
> > > at eBay
> > > Existing eBay's domain specific tools take a long time to identify and
> > > fix poor data quality when data flowed through multiple systems
> > > Business logic becomes complex, requires data quality system much
> > flexible.
> > >
> > > Some data quality issues do have business impact on user experiences,
> > > revenue, efficiency & compliance.
> > >
> > > Communication overhead of data quality metrics, typically in a big
> > > organization, which involve different teams.
> > >
> > > The idea of Griffin is to provide Data Quality validation as a
> > > Service, to allow data engineers and data consumers to have:
> > >
> > > Near real-time understanding of the data quality health of your data
> > > pipelines with end-to-end monitoring, all in one place.
> > > Profiling, detecting and correlating issues and providing
> > > recommendations that drive rapid and focused troubleshooting
> > > A centralized data quality model management system including rule,
> > > metadata, scheduler etc.
> > > Native code generation to run everywhere, including Hadoop, Kafka,
> Spark,
> > > etc.
> > > One set of tools to build data quality pipelines across all eBay data
> > > platforms.
> > >
> > > Current Status
> > >
> > > Meritocracy
> > >
> > > Griffin has been deployed in production at eBay and provided the
> > > centralized data quality service for several eBay systems ( for
> > > example, real time personalization platform, eBay real time ID linking
> > > platform, Hadoop datasets, Site speed analytics platform). Our aim is
> > > to build a diverse developer and user community following the Apache
> > > meritocracy model. We will encourage contributions and participation
> > > of all types of work, and ensure that contributors are appropriately
> > > recognized.
> > >
> > > Community
> > >
> > > Currently the project is being developed at eBay. It's only for eBay
> > > internal community. Griffin seeks to develop the developer and user
> > > communities during incubation. We believe it will grow substantially
> > > by becoming an Apache project.
> > >
> > > Core Developers
> > >
> > > Griffin is currently being designed and developed by engineers from
> > > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> > > All of these core developers have deep expertise in Apache Hadoop and
> > > the Hadoop Ecosystem in general.
> > >
> > > Alignment
> > >
> > > The ASF is a natural host for Griffin given that it is already the
> > > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> > > emerging big data products. Those are requiring data quality solution
> > > by nature to ensure the data quality which they processed. When people
> > > use open source data technology, the big question to them is that how
> > > we can ensure the data quality in it. Griffin leverages lot of Apache
> > > open-source products. Griffin was designed to enable real time
> > > insights into data quality validation by shared Infrastructure and
> > > generic features to solve common data quality pain points.
> > >
> > > Known Risks
> > >
> > > Orphaned Products
> > >
> > > The core developers of Griffin team work full time on this project.
> > > There is no risk of Griffin getting orphaned since at least one large
> > > company (eBay) is extensively using it in their production Hadoop and
> > > Spark clusters for multiple data systems. For example, currently there
> > > are 4 data systems at eBay (real time personalization platform, eBay
> > > real time ID linking platform, Hadoop, Site speed analytics platform)
> > > are leveraging Griffin, with more than ~600M records for data quality
> > > status validation every day, 35 data sets being monitored, 50+ data
> > > quality models have been created.
> > >
> > > As Griffin is designed to connect many types of data sources, we are
> > > very confident that they will use Griffin as a service for ensuring
> > > the data quality in open source data ecosystems. We plan to extend and
> > > diversify this community further through Apache.
> > >
> > > Inexperience with Open Source
> > >
> > > Griffin's core engineers are all active users and followers of open
> > > source projects. They are already committers and contributors to the
> > > Griffin Github project. All have been involved with the source code
> > > that has been released under an open source license, and several of
> > > them also have experience developing code in an open source
> > > environment. Though the core set of Developers do not have Apache Open
> > > Source experience, there are plans to onboard individuals with Apache
> > > open source experience on to the project.
> > >
> > > Homogenous Developers
> > >
> > > The core developers are from eBay. Apache Incubation process
> > > encourages an open and diverse meritocratic community. Griffin intends
> > > to make every possible effort to build a diverse, vibrant and involved
> > > community. We are committed to recruiting additional committers from
> > > other companies based on their contribution to the project.
> > >
> > > Reliance on Salaried Developers
> > >
> > > eBay invested in Griffin as a company-wide data quality service
> > > platform and some of its key engineers are working full time on the
> > > project. they are all paid by eBay. We look forward to other Apache
> > > developers and researchers to contribute to the project.
> > >
> > > Relationships with Other Apache Products
> > >
> > > Griffin has a strong relationship and dependency with Apache Hadoop,
> > > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> > > Hive. In addition, since there is a growing need for data quality
> > > solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> > > being part of Apache’s Incubation community, could help with a closer
> > > collaboration among these four projects and as well as others.
> > >
> > > Documentation
> > >
> > > Information about Griffin can be found at https://github.com/eBay/
> > griffin
> > >
> > > Initial Source
> > >
> > > Griffin has been under development since early 2016 by a team of
> > > engineers at eBay Inc. It is currently hosted on Github.com under an
> > > Apache license 2.0 at https://github.com/eBay/griffin . Once in
> > > incubation we will be moving the code base to apache git library.
> > >
> > > External Dependencies
> > >
> > > Griffin has the following external dependencies.
> > >
> > > Basic
> > >
> > > JDK 1.7+
> > > Scala
> > > Apache Maven
> > > JUnit
> > > Log4j
> > > Slf4j
> > > Apache Commons
> > >
> > > Hadoop
> > >
> > > Apache Hadoop
> > > Apache HBase
> > > Apache Hive
> > >
> > > DB
> > >
> > > InfluxData
> > >
> > > Apache Spark
> > >
> > > Spark Core Library
> > >
> > > REST Service
> > >
> > > Jersey
> > > Spring MVC
> > >
> > > Web frontend
> > >
> > > AngularJS
> > > jQuery
> > > Bootstrap
> > > RequireJS
> > > eCharts
> > > Font Awesome
> > >
> > > Cryptography
> > >
> > > Currently there's no cryptography in Griffin.
> > >
> > > Required Resources
> > >
> > > Mailing List
> > >
> > > We currently use eBay mail box to communicate, but we'd like to move
> > > that to ASF maintained mailing lists.
> > >
> > > Current mailing list: ebay-griffin-devs@googlegroups.com
> > >
> > > Proposed ASF maintained lists:
> > >
> > > private@griffin.incubator.apache.org
> > >
> > > dev@griffin.incubator.apache.org
> > >
> > > commits@griffin.incubator.apache.org
> > >
> > > Subversion Directory
> > >
> > > Git is the preferred source control system.
> > >
> > > Issue Tracking
> > >
> > > JIRA
> > >
> > > Other Resources
> > >
> > > The existing code already has unit tests so we will make use of
> > > existing Apache continuous testing infrastructure. The resulting load
> > > should not be very large.
> > >
> > > Initial Committers
> > >
> > > William Go
> > > Alex Lv
> > > Vincent Zhao
> > > Shawn Sha
> > > John Liu
> > > Liang Shao
> > >
> > > Affiliations
> > >
> > > The initial committers are employees of eBay Inc.
> > >
> > > Sponsors
> > >
> > > Champion
> > >
> > > Henry Saputra (hsaputra@apache.org)
> > >
> > > Nominated Mentors
> > >
> > > Kasper Sørensen (kaspersor@apache.org)
> > >
> > > Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
> > >
> > > Luciano Resende (luckbr1975@gmail.com)
> > >
> > > Sponsoring Entity
> > >
> > > We are requesting the Incubator to sponsor this project.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by Henry Saputra <he...@gmail.com>.
Hi John,

We have added this comment in the proposal:

"
The initial committers are employees of eBay Inc.
"

- Henry

On Wed, Nov 23, 2016 at 4:50 PM, John D. Ament <jo...@apache.org>
wrote:

> Henry,
>
> Can you add initial committer affiliations to the proposal?
>
> John
>
> On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <he...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > As the champion for Griffin, I would like to bring up discussion to
> > bring the project as Apache incubator podling.
> >
> > Here is the direct quote from the abstract:
> >
> > "
> > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > Apache Spark. It provides a framework process for defining data
> > quality model, executing data quality measurement, automating data
> > profiling and validation, as well as a unified data quality
> > visualization across multiple data systems. It tries to address the
> > data quality challenges in big data and streaming context.
> > "
> >
> > Here is the link to the proposal:
> > https://wiki.apache.org/incubator/GriffinProposal
> >
> > I have copied the proposal below for easy access
> >
> >
> > Thanks,
> >
> > - Henry
> >
> >
> > Griffin Proposal
> >
> > Abstract
> >
> > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > Apache Spark. It provides a framework process for defining data
> > quality model, executing data quality measurement, automating data
> > profiling and validation, as well as a unified data quality
> > visualization across multiple data systems. It tries to address the
> > data quality challenges in big data and streaming context.
> >
> > Proposal
> >
> > Griffin is a open source Data Quality solution for distributed data
> > systems at any scale in both streaming or batch data context. When
> > people use open source products (e.g. Apache Hadoop, Apache Spark,
> > Apache Kafka, Apache Storm), they always need a data quality service
> > to build his/her confidence on data quality processed by those
> > platforms. Griffin creates a unified process to define and construct
> > data quality measurement pipeline across multiple data systems to
> > provide:
> >
> > Automatic quality validation of the data
> > Data profiling and anomaly detection
> > Data quality lineage from upstream to downstream data systems.
> > Data quality health monitoring visualization
> > Shared infrastructure resource management
> >
> > Overview of Griffin
> >
> > Griffin has been deployed in production at eBay serving major data
> > systems, it takes a platform approach to provide generic features to
> > solve common data quality validation pain points. Firstly, user can
> > register the data asset which user wants to do data quality check. The
> > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> > system or near real-time streaming data from Apache Kafka, Apache
> > Storm and other real time data platforms. Secondly, user can create
> > data quality model to define the data quality rule and metadata.
> > Thirdly, the model or rule will be executed automatically (by the
> > model engine) to get the sample data quality validation results in a
> > few seconds for streaming data. Finally, user can analyze the data
> > quality results through built-in visualization tool to take actions.
> >
> > Griffin includes:
> >
> > Data Quality Model Engine
> >
> > Griffin is model driven solution, user can choose various data quality
> > dimension to execute his/her data quality validation based on selected
> > target data-set or source data-set ( as the golden reference data). It
> > has a corresponding library supporting it in back-end for the
> > following measurement:
> >
> > Accuracy - Does data reflect the real-world objects or a verifiable
> source
> > Completeness - Is all necessary data present
> > Validity - Are all data values within the data domains specified by the
> > business
> > Timeliness - Is the data available at the time needed
> > Anomaly detection - Pre-built algorithm functions for the
> > identification of items, events or observations which do not conform
> > to an expected pattern or other items in a dataset
> > Data Profiling - Apply statistical analysis and assessment of data
> > values within a dataset for consistency, uniqueness and logic.
> >
> > Data Collection Layer
> >
> > We support two kinds of data sources, batch data and real time data.
> >
> > For batch mode, we can collect data source from Apache Hadoop based
> > platform by various data connectors.
> >
> > For real time mode, we can connect with messaging system like Kafka to
> > near real time analysis.
> >
> > Data Process and Storage Layer
> >
> > For batch analysis, our data quality model will compute data quality
> > metrics in our spark cluster based on data source in Apache Hadoop.
> >
> > For near real time analysis, we consume data from messaging system,
> > then our data quality model will compute our real time data quality
> > metrics in our spark cluster. for data storage, we use time series
> > database in our back end to fulfill front end request.
> >
> > Griffin Service
> >
> > We have RESTful web services to accomplish all the functionalities of
> > Griffin, such as register data asset, create data quality model,
> > publish metrics, retrieve metrics, add subscription, etc. So, the
> > developers can develop their own user interface based on these web
> > services.
> >
> > Background
> >
> > At eBay, when people play with big data in Apache Hadoop (or other
> > streaming data), data quality often becomes one big challenge.
> > Different teams have built customized data quality tools to detect and
> > analyze data quality issues within their own domain. We are thinking
> > to take a platform approach to provide shared Infrastructure and
> > generic features to solve common data quality pain points. This would
> > enable us to build trusted data assets.
> >
> > Currently it’s very difficult and costly to do data quality validation
> > when we have big data flow across multi-platforms at eBay (e.g.
> > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> > MongoDB). Take eBay real time personalization platform as an example.
> > Every day we have to validate data quality status for ~600M records (
> > imagine we have 150M active users for our website). Data quality often
> > becomes one big challenge both in its streaming and batch pipelines.
> >
> > So we conclude 3 data quality problems at eBay:
> >
> > Lack of end2end unified view of data quality measurement from multiple
> > data sources to target applications, it usually takes a long time to
> > identify and fix poor data quality.
> > How to get data quality measured in streaming mode, we need to have a
> > process and tool to visualize data quality insights through
> > registering dataset which you want to check data quality, creating
> > data quality measurement model, executing the data quality validation
> > job and getting metrics insights for action taking.
> > No Shared platform and API Service, have to apply and manage own
> > hardware and software infrastructure.
> >
> > Rationale
> >
> > The challenge we face at eBay is that our data volume is becoming
> > bigger and bigger, system processes become more complex, while we do
> > not have a unified data quality solution to ensure the trusted data
> > sets which provide confidences on data quality to our data consumers.
> > The key challenges on data quality includes:
> >
> > Existing commercial data quality solution cannot address data quality
> > lineage among systems, cannot scale out to support fast growing data
> > at eBay
> > Existing eBay's domain specific tools take a long time to identify and
> > fix poor data quality when data flowed through multiple systems
> > Business logic becomes complex, requires data quality system much
> flexible.
> >
> > Some data quality issues do have business impact on user experiences,
> > revenue, efficiency & compliance.
> >
> > Communication overhead of data quality metrics, typically in a big
> > organization, which involve different teams.
> >
> > The idea of Griffin is to provide Data Quality validation as a
> > Service, to allow data engineers and data consumers to have:
> >
> > Near real-time understanding of the data quality health of your data
> > pipelines with end-to-end monitoring, all in one place.
> > Profiling, detecting and correlating issues and providing
> > recommendations that drive rapid and focused troubleshooting
> > A centralized data quality model management system including rule,
> > metadata, scheduler etc.
> > Native code generation to run everywhere, including Hadoop, Kafka, Spark,
> > etc.
> > One set of tools to build data quality pipelines across all eBay data
> > platforms.
> >
> > Current Status
> >
> > Meritocracy
> >
> > Griffin has been deployed in production at eBay and provided the
> > centralized data quality service for several eBay systems ( for
> > example, real time personalization platform, eBay real time ID linking
> > platform, Hadoop datasets, Site speed analytics platform). Our aim is
> > to build a diverse developer and user community following the Apache
> > meritocracy model. We will encourage contributions and participation
> > of all types of work, and ensure that contributors are appropriately
> > recognized.
> >
> > Community
> >
> > Currently the project is being developed at eBay. It's only for eBay
> > internal community. Griffin seeks to develop the developer and user
> > communities during incubation. We believe it will grow substantially
> > by becoming an Apache project.
> >
> > Core Developers
> >
> > Griffin is currently being designed and developed by engineers from
> > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> > All of these core developers have deep expertise in Apache Hadoop and
> > the Hadoop Ecosystem in general.
> >
> > Alignment
> >
> > The ASF is a natural host for Griffin given that it is already the
> > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> > emerging big data products. Those are requiring data quality solution
> > by nature to ensure the data quality which they processed. When people
> > use open source data technology, the big question to them is that how
> > we can ensure the data quality in it. Griffin leverages lot of Apache
> > open-source products. Griffin was designed to enable real time
> > insights into data quality validation by shared Infrastructure and
> > generic features to solve common data quality pain points.
> >
> > Known Risks
> >
> > Orphaned Products
> >
> > The core developers of Griffin team work full time on this project.
> > There is no risk of Griffin getting orphaned since at least one large
> > company (eBay) is extensively using it in their production Hadoop and
> > Spark clusters for multiple data systems. For example, currently there
> > are 4 data systems at eBay (real time personalization platform, eBay
> > real time ID linking platform, Hadoop, Site speed analytics platform)
> > are leveraging Griffin, with more than ~600M records for data quality
> > status validation every day, 35 data sets being monitored, 50+ data
> > quality models have been created.
> >
> > As Griffin is designed to connect many types of data sources, we are
> > very confident that they will use Griffin as a service for ensuring
> > the data quality in open source data ecosystems. We plan to extend and
> > diversify this community further through Apache.
> >
> > Inexperience with Open Source
> >
> > Griffin's core engineers are all active users and followers of open
> > source projects. They are already committers and contributors to the
> > Griffin Github project. All have been involved with the source code
> > that has been released under an open source license, and several of
> > them also have experience developing code in an open source
> > environment. Though the core set of Developers do not have Apache Open
> > Source experience, there are plans to onboard individuals with Apache
> > open source experience on to the project.
> >
> > Homogenous Developers
> >
> > The core developers are from eBay. Apache Incubation process
> > encourages an open and diverse meritocratic community. Griffin intends
> > to make every possible effort to build a diverse, vibrant and involved
> > community. We are committed to recruiting additional committers from
> > other companies based on their contribution to the project.
> >
> > Reliance on Salaried Developers
> >
> > eBay invested in Griffin as a company-wide data quality service
> > platform and some of its key engineers are working full time on the
> > project. they are all paid by eBay. We look forward to other Apache
> > developers and researchers to contribute to the project.
> >
> > Relationships with Other Apache Products
> >
> > Griffin has a strong relationship and dependency with Apache Hadoop,
> > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> > Hive. In addition, since there is a growing need for data quality
> > solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> > being part of Apache’s Incubation community, could help with a closer
> > collaboration among these four projects and as well as others.
> >
> > Documentation
> >
> > Information about Griffin can be found at https://github.com/eBay/
> griffin
> >
> > Initial Source
> >
> > Griffin has been under development since early 2016 by a team of
> > engineers at eBay Inc. It is currently hosted on Github.com under an
> > Apache license 2.0 at https://github.com/eBay/griffin . Once in
> > incubation we will be moving the code base to apache git library.
> >
> > External Dependencies
> >
> > Griffin has the following external dependencies.
> >
> > Basic
> >
> > JDK 1.7+
> > Scala
> > Apache Maven
> > JUnit
> > Log4j
> > Slf4j
> > Apache Commons
> >
> > Hadoop
> >
> > Apache Hadoop
> > Apache HBase
> > Apache Hive
> >
> > DB
> >
> > InfluxData
> >
> > Apache Spark
> >
> > Spark Core Library
> >
> > REST Service
> >
> > Jersey
> > Spring MVC
> >
> > Web frontend
> >
> > AngularJS
> > jQuery
> > Bootstrap
> > RequireJS
> > eCharts
> > Font Awesome
> >
> > Cryptography
> >
> > Currently there's no cryptography in Griffin.
> >
> > Required Resources
> >
> > Mailing List
> >
> > We currently use eBay mail box to communicate, but we'd like to move
> > that to ASF maintained mailing lists.
> >
> > Current mailing list: ebay-griffin-devs@googlegroups.com
> >
> > Proposed ASF maintained lists:
> >
> > private@griffin.incubator.apache.org
> >
> > dev@griffin.incubator.apache.org
> >
> > commits@griffin.incubator.apache.org
> >
> > Subversion Directory
> >
> > Git is the preferred source control system.
> >
> > Issue Tracking
> >
> > JIRA
> >
> > Other Resources
> >
> > The existing code already has unit tests so we will make use of
> > existing Apache continuous testing infrastructure. The resulting load
> > should not be very large.
> >
> > Initial Committers
> >
> > William Go
> > Alex Lv
> > Vincent Zhao
> > Shawn Sha
> > John Liu
> > Liang Shao
> >
> > Affiliations
> >
> > The initial committers are employees of eBay Inc.
> >
> > Sponsors
> >
> > Champion
> >
> > Henry Saputra (hsaputra@apache.org)
> >
> > Nominated Mentors
> >
> > Kasper Sørensen (kaspersor@apache.org)
> >
> > Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
> >
> > Luciano Resende (luckbr1975@gmail.com)
> >
> > Sponsoring Entity
> >
> > We are requesting the Incubator to sponsor this project.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by "John D. Ament" <jo...@apache.org>.
Henry,

Can you add initial committer affiliations to the proposal?

John

On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <he...@gmail.com>
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to bring up discussion to
> bring the project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Here is the link to the proposal:
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish all the functionalities of
> Griffin, such as register data asset, create data quality model,
> publish metrics, retrieve metrics, add subscription, etc. So, the
> developers can develop their own user interface based on these web
> services.
>
> Background
>
> At eBay, when people play with big data in Apache Hadoop (or other
> streaming data), data quality often becomes one big challenge.
> Different teams have built customized data quality tools to detect and
> analyze data quality issues within their own domain. We are thinking
> to take a platform approach to provide shared Infrastructure and
> generic features to solve common data quality pain points. This would
> enable us to build trusted data assets.
>
> Currently it’s very difficult and costly to do data quality validation
> when we have big data flow across multi-platforms at eBay (e.g.
> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> MongoDB). Take eBay real time personalization platform as an example.
> Every day we have to validate data quality status for ~600M records (
> imagine we have 150M active users for our website). Data quality often
> becomes one big challenge both in its streaming and batch pipelines.
>
> So we conclude 3 data quality problems at eBay:
>
> Lack of end2end unified view of data quality measurement from multiple
> data sources to target applications, it usually takes a long time to
> identify and fix poor data quality.
> How to get data quality measured in streaming mode, we need to have a
> process and tool to visualize data quality insights through
> registering dataset which you want to check data quality, creating
> data quality measurement model, executing the data quality validation
> job and getting metrics insights for action taking.
> No Shared platform and API Service, have to apply and manage own
> hardware and software infrastructure.
>
> Rationale
>
> The challenge we face at eBay is that our data volume is becoming
> bigger and bigger, system processes become more complex, while we do
> not have a unified data quality solution to ensure the trusted data
> sets which provide confidences on data quality to our data consumers.
> The key challenges on data quality includes:
>
> Existing commercial data quality solution cannot address data quality
> lineage among systems, cannot scale out to support fast growing data
> at eBay
> Existing eBay's domain specific tools take a long time to identify and
> fix poor data quality when data flowed through multiple systems
> Business logic becomes complex, requires data quality system much flexible.
>
> Some data quality issues do have business impact on user experiences,
> revenue, efficiency & compliance.
>
> Communication overhead of data quality metrics, typically in a big
> organization, which involve different teams.
>
> The idea of Griffin is to provide Data Quality validation as a
> Service, to allow data engineers and data consumers to have:
>
> Near real-time understanding of the data quality health of your data
> pipelines with end-to-end monitoring, all in one place.
> Profiling, detecting and correlating issues and providing
> recommendations that drive rapid and focused troubleshooting
> A centralized data quality model management system including rule,
> metadata, scheduler etc.
> Native code generation to run everywhere, including Hadoop, Kafka, Spark,
> etc.
> One set of tools to build data quality pipelines across all eBay data
> platforms.
>
> Current Status
>
> Meritocracy
>
> Griffin has been deployed in production at eBay and provided the
> centralized data quality service for several eBay systems ( for
> example, real time personalization platform, eBay real time ID linking
> platform, Hadoop datasets, Site speed analytics platform). Our aim is
> to build a diverse developer and user community following the Apache
> meritocracy model. We will encourage contributions and participation
> of all types of work, and ensure that contributors are appropriately
> recognized.
>
> Community
>
> Currently the project is being developed at eBay. It's only for eBay
> internal community. Griffin seeks to develop the developer and user
> communities during incubation. We believe it will grow substantially
> by becoming an Apache project.
>
> Core Developers
>
> Griffin is currently being designed and developed by engineers from
> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> All of these core developers have deep expertise in Apache Hadoop and
> the Hadoop Ecosystem in general.
>
> Alignment
>
> The ASF is a natural host for Griffin given that it is already the
> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> emerging big data products. Those are requiring data quality solution
> by nature to ensure the data quality which they processed. When people
> use open source data technology, the big question to them is that how
> we can ensure the data quality in it. Griffin leverages lot of Apache
> open-source products. Griffin was designed to enable real time
> insights into data quality validation by shared Infrastructure and
> generic features to solve common data quality pain points.
>
> Known Risks
>
> Orphaned Products
>
> The core developers of Griffin team work full time on this project.
> There is no risk of Griffin getting orphaned since at least one large
> company (eBay) is extensively using it in their production Hadoop and
> Spark clusters for multiple data systems. For example, currently there
> are 4 data systems at eBay (real time personalization platform, eBay
> real time ID linking platform, Hadoop, Site speed analytics platform)
> are leveraging Griffin, with more than ~600M records for data quality
> status validation every day, 35 data sets being monitored, 50+ data
> quality models have been created.
>
> As Griffin is designed to connect many types of data sources, we are
> very confident that they will use Griffin as a service for ensuring
> the data quality in open source data ecosystems. We plan to extend and
> diversify this community further through Apache.
>
> Inexperience with Open Source
>
> Griffin's core engineers are all active users and followers of open
> source projects. They are already committers and contributors to the
> Griffin Github project. All have been involved with the source code
> that has been released under an open source license, and several of
> them also have experience developing code in an open source
> environment. Though the core set of Developers do not have Apache Open
> Source experience, there are plans to onboard individuals with Apache
> open source experience on to the project.
>
> Homogenous Developers
>
> The core developers are from eBay. Apache Incubation process
> encourages an open and diverse meritocratic community. Griffin intends
> to make every possible effort to build a diverse, vibrant and involved
> community. We are committed to recruiting additional committers from
> other companies based on their contribution to the project.
>
> Reliance on Salaried Developers
>
> eBay invested in Griffin as a company-wide data quality service
> platform and some of its key engineers are working full time on the
> project. they are all paid by eBay. We look forward to other Apache
> developers and researchers to contribute to the project.
>
> Relationships with Other Apache Products
>
> Griffin has a strong relationship and dependency with Apache Hadoop,
> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> Hive. In addition, since there is a growing need for data quality
> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> being part of Apache’s Incubation community, could help with a closer
> collaboration among these four projects and as well as others.
>
> Documentation
>
> Information about Griffin can be found at https://github.com/eBay/griffin
>
> Initial Source
>
> Griffin has been under development since early 2016 by a team of
> engineers at eBay Inc. It is currently hosted on Github.com under an
> Apache license 2.0 at https://github.com/eBay/griffin . Once in
> incubation we will be moving the code base to apache git library.
>
> External Dependencies
>
> Griffin has the following external dependencies.
>
> Basic
>
> JDK 1.7+
> Scala
> Apache Maven
> JUnit
> Log4j
> Slf4j
> Apache Commons
>
> Hadoop
>
> Apache Hadoop
> Apache HBase
> Apache Hive
>
> DB
>
> InfluxData
>
> Apache Spark
>
> Spark Core Library
>
> REST Service
>
> Jersey
> Spring MVC
>
> Web frontend
>
> AngularJS
> jQuery
> Bootstrap
> RequireJS
> eCharts
> Font Awesome
>
> Cryptography
>
> Currently there's no cryptography in Griffin.
>
> Required Resources
>
> Mailing List
>
> We currently use eBay mail box to communicate, but we'd like to move
> that to ASF maintained mailing lists.
>
> Current mailing list: ebay-griffin-devs@googlegroups.com
>
> Proposed ASF maintained lists:
>
> private@griffin.incubator.apache.org
>
> dev@griffin.incubator.apache.org
>
> commits@griffin.incubator.apache.org
>
> Subversion Directory
>
> Git is the preferred source control system.
>
> Issue Tracking
>
> JIRA
>
> Other Resources
>
> The existing code already has unit tests so we will make use of
> existing Apache continuous testing infrastructure. The resulting load
> should not be very large.
>
> Initial Committers
>
> William Go
> Alex Lv
> Vincent Zhao
> Shawn Sha
> John Liu
> Liang Shao
>
> Affiliations
>
> The initial committers are employees of eBay Inc.
>
> Sponsors
>
> Champion
>
> Henry Saputra (hsaputra@apache.org)
>
> Nominated Mentors
>
> Kasper Sørensen (kaspersor@apache.org)
>
> Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
>
> Luciano Resende (luckbr1975@gmail.com)
>
> Sponsoring Entity
>
> We are requesting the Incubator to sponsor this project.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by 吕 志兴 <lu...@hotmail.com>.
eBay legal has already reviewed this name before open source to github, there's no trademark problem for griffin.

Thx

Alex

> 在 2016年11月24日,下午4:57,Jochen Theodorou <bl...@gmx.org> 写道:
> 
> just a remark on the name. Griffin is used in a lot of company names and is a family name. That may induce trademark problems, but I did not do any research on that.
> 
> Well and I am not happy about Griffin and Griffon being so near together, even though the later project most likely has no trademark as such.
> 
>> On 24.11.2016 00:30, Henry Saputra wrote:
>> Hi All,
>> 
>> As the champion for Griffin, I would like to bring up discussion to
>> bring the project as Apache incubator podling.
>> 
>> Here is the direct quote from the abstract:
>> 
>> "
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> "
>> 
>> Here is the link to the proposal:
>> https://wiki.apache.org/incubator/GriffinProposal
>> 
>> I have copied the proposal below for easy access
>> 
>> 
>> Thanks,
>> 
>> - Henry
>> 
>> 
>> Griffin Proposal
>> 
>> Abstract
>> 
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> 
>> Proposal
>> 
>> Griffin is a open source Data Quality solution for distributed data
>> systems at any scale in both streaming or batch data context. When
>> people use open source products (e.g. Apache Hadoop, Apache Spark,
>> Apache Kafka, Apache Storm), they always need a data quality service
>> to build his/her confidence on data quality processed by those
>> platforms. Griffin creates a unified process to define and construct
>> data quality measurement pipeline across multiple data systems to
>> provide:
>> 
>> Automatic quality validation of the data
>> Data profiling and anomaly detection
>> Data quality lineage from upstream to downstream data systems.
>> Data quality health monitoring visualization
>> Shared infrastructure resource management
>> 
>> Overview of Griffin
>> 
>> Griffin has been deployed in production at eBay serving major data
>> systems, it takes a platform approach to provide generic features to
>> solve common data quality validation pain points. Firstly, user can
>> register the data asset which user wants to do data quality check. The
>> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
>> system or near real-time streaming data from Apache Kafka, Apache
>> Storm and other real time data platforms. Secondly, user can create
>> data quality model to define the data quality rule and metadata.
>> Thirdly, the model or rule will be executed automatically (by the
>> model engine) to get the sample data quality validation results in a
>> few seconds for streaming data. Finally, user can analyze the data
>> quality results through built-in visualization tool to take actions.
>> 
>> Griffin includes:
>> 
>> Data Quality Model Engine
>> 
>> Griffin is model driven solution, user can choose various data quality
>> dimension to execute his/her data quality validation based on selected
>> target data-set or source data-set ( as the golden reference data). It
>> has a corresponding library supporting it in back-end for the
>> following measurement:
>> 
>> Accuracy - Does data reflect the real-world objects or a verifiable source
>> Completeness - Is all necessary data present
>> Validity - Are all data values within the data domains specified by the business
>> Timeliness - Is the data available at the time needed
>> Anomaly detection - Pre-built algorithm functions for the
>> identification of items, events or observations which do not conform
>> to an expected pattern or other items in a dataset
>> Data Profiling - Apply statistical analysis and assessment of data
>> values within a dataset for consistency, uniqueness and logic.
>> 
>> Data Collection Layer
>> 
>> We support two kinds of data sources, batch data and real time data.
>> 
>> For batch mode, we can collect data source from Apache Hadoop based
>> platform by various data connectors.
>> 
>> For real time mode, we can connect with messaging system like Kafka to
>> near real time analysis.
>> 
>> Data Process and Storage Layer
>> 
>> For batch analysis, our data quality model will compute data quality
>> metrics in our spark cluster based on data source in Apache Hadoop.
>> 
>> For near real time analysis, we consume data from messaging system,
>> then our data quality model will compute our real time data quality
>> metrics in our spark cluster. for data storage, we use time series
>> database in our back end to fulfill front end request.
>> 
>> Griffin Service
>> 
>> We have RESTful web services to accomplish all the functionalities of
>> Griffin, such as register data asset, create data quality model,
>> publish metrics, retrieve metrics, add subscription, etc. So, the
>> developers can develop their own user interface based on these web
>> services.
>> 
>> Background
>> 
>> At eBay, when people play with big data in Apache Hadoop (or other
>> streaming data), data quality often becomes one big challenge.
>> Different teams have built customized data quality tools to detect and
>> analyze data quality issues within their own domain. We are thinking
>> to take a platform approach to provide shared Infrastructure and
>> generic features to solve common data quality pain points. This would
>> enable us to build trusted data assets.
>> 
>> Currently it’s very difficult and costly to do data quality validation
>> when we have big data flow across multi-platforms at eBay (e.g.
>> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
>> MongoDB). Take eBay real time personalization platform as an example.
>> Every day we have to validate data quality status for ~600M records (
>> imagine we have 150M active users for our website). Data quality often
>> becomes one big challenge both in its streaming and batch pipelines.
>> 
>> So we conclude 3 data quality problems at eBay:
>> 
>> Lack of end2end unified view of data quality measurement from multiple
>> data sources to target applications, it usually takes a long time to
>> identify and fix poor data quality.
>> How to get data quality measured in streaming mode, we need to have a
>> process and tool to visualize data quality insights through
>> registering dataset which you want to check data quality, creating
>> data quality measurement model, executing the data quality validation
>> job and getting metrics insights for action taking.
>> No Shared platform and API Service, have to apply and manage own
>> hardware and software infrastructure.
>> 
>> Rationale
>> 
>> The challenge we face at eBay is that our data volume is becoming
>> bigger and bigger, system processes become more complex, while we do
>> not have a unified data quality solution to ensure the trusted data
>> sets which provide confidences on data quality to our data consumers.
>> The key challenges on data quality includes:
>> 
>> Existing commercial data quality solution cannot address data quality
>> lineage among systems, cannot scale out to support fast growing data
>> at eBay
>> Existing eBay's domain specific tools take a long time to identify and
>> fix poor data quality when data flowed through multiple systems
>> Business logic becomes complex, requires data quality system much flexible.
>> 
>> Some data quality issues do have business impact on user experiences,
>> revenue, efficiency & compliance.
>> 
>> Communication overhead of data quality metrics, typically in a big
>> organization, which involve different teams.
>> 
>> The idea of Griffin is to provide Data Quality validation as a
>> Service, to allow data engineers and data consumers to have:
>> 
>> Near real-time understanding of the data quality health of your data
>> pipelines with end-to-end monitoring, all in one place.
>> Profiling, detecting and correlating issues and providing
>> recommendations that drive rapid and focused troubleshooting
>> A centralized data quality model management system including rule,
>> metadata, scheduler etc.
>> Native code generation to run everywhere, including Hadoop, Kafka, Spark, etc.
>> One set of tools to build data quality pipelines across all eBay data platforms.
>> 
>> Current Status
>> 
>> Meritocracy
>> 
>> Griffin has been deployed in production at eBay and provided the
>> centralized data quality service for several eBay systems ( for
>> example, real time personalization platform, eBay real time ID linking
>> platform, Hadoop datasets, Site speed analytics platform). Our aim is
>> to build a diverse developer and user community following the Apache
>> meritocracy model. We will encourage contributions and participation
>> of all types of work, and ensure that contributors are appropriately
>> recognized.
>> 
>> Community
>> 
>> Currently the project is being developed at eBay. It's only for eBay
>> internal community. Griffin seeks to develop the developer and user
>> communities during incubation. We believe it will grow substantially
>> by becoming an Apache project.
>> 
>> Core Developers
>> 
>> Griffin is currently being designed and developed by engineers from
>> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
>> All of these core developers have deep expertise in Apache Hadoop and
>> the Hadoop Ecosystem in general.
>> 
>> Alignment
>> 
>> The ASF is a natural host for Griffin given that it is already the
>> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
>> emerging big data products. Those are requiring data quality solution
>> by nature to ensure the data quality which they processed. When people
>> use open source data technology, the big question to them is that how
>> we can ensure the data quality in it. Griffin leverages lot of Apache
>> open-source products. Griffin was designed to enable real time
>> insights into data quality validation by shared Infrastructure and
>> generic features to solve common data quality pain points.
>> 
>> Known Risks
>> 
>> Orphaned Products
>> 
>> The core developers of Griffin team work full time on this project.
>> There is no risk of Griffin getting orphaned since at least one large
>> company (eBay) is extensively using it in their production Hadoop and
>> Spark clusters for multiple data systems. For example, currently there
>> are 4 data systems at eBay (real time personalization platform, eBay
>> real time ID linking platform, Hadoop, Site speed analytics platform)
>> are leveraging Griffin, with more than ~600M records for data quality
>> status validation every day, 35 data sets being monitored, 50+ data
>> quality models have been created.
>> 
>> As Griffin is designed to connect many types of data sources, we are
>> very confident that they will use Griffin as a service for ensuring
>> the data quality in open source data ecosystems. We plan to extend and
>> diversify this community further through Apache.
>> 
>> Inexperience with Open Source
>> 
>> Griffin's core engineers are all active users and followers of open
>> source projects. They are already committers and contributors to the
>> Griffin Github project. All have been involved with the source code
>> that has been released under an open source license, and several of
>> them also have experience developing code in an open source
>> environment. Though the core set of Developers do not have Apache Open
>> Source experience, there are plans to onboard individuals with Apache
>> open source experience on to the project.
>> 
>> Homogenous Developers
>> 
>> The core developers are from eBay. Apache Incubation process
>> encourages an open and diverse meritocratic community. Griffin intends
>> to make every possible effort to build a diverse, vibrant and involved
>> community. We are committed to recruiting additional committers from
>> other companies based on their contribution to the project.
>> 
>> Reliance on Salaried Developers
>> 
>> eBay invested in Griffin as a company-wide data quality service
>> platform and some of its key engineers are working full time on the
>> project. they are all paid by eBay. We look forward to other Apache
>> developers and researchers to contribute to the project.
>> 
>> Relationships with Other Apache Products
>> 
>> Griffin has a strong relationship and dependency with Apache Hadoop,
>> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
>> Hive. In addition, since there is a growing need for data quality
>> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
>> being part of Apache’s Incubation community, could help with a closer
>> collaboration among these four projects and as well as others.
>> 
>> Documentation
>> 
>> Information about Griffin can be found at https://github.com/eBay/griffin
>> 
>> Initial Source
>> 
>> Griffin has been under development since early 2016 by a team of
>> engineers at eBay Inc. It is currently hosted on Github.com under an
>> Apache license 2.0 at https://github.com/eBay/griffin . Once in
>> incubation we will be moving the code base to apache git library.
>> 
>> External Dependencies
>> 
>> Griffin has the following external dependencies.
>> 
>> Basic
>> 
>> JDK 1.7+
>> Scala
>> Apache Maven
>> JUnit
>> Log4j
>> Slf4j
>> Apache Commons
>> 
>> Hadoop
>> 
>> Apache Hadoop
>> Apache HBase
>> Apache Hive
>> 
>> DB
>> 
>> InfluxData
>> 
>> Apache Spark
>> 
>> Spark Core Library
>> 
>> REST Service
>> 
>> Jersey
>> Spring MVC
>> 
>> Web frontend
>> 
>> AngularJS
>> jQuery
>> Bootstrap
>> RequireJS
>> eCharts
>> Font Awesome
>> 
>> Cryptography
>> 
>> Currently there's no cryptography in Griffin.
>> 
>> Required Resources
>> 
>> Mailing List
>> 
>> We currently use eBay mail box to communicate, but we'd like to move
>> that to ASF maintained mailing lists.
>> 
>> Current mailing list: ebay-griffin-devs@googlegroups.com
>> 
>> Proposed ASF maintained lists:
>> 
>> private@griffin.incubator.apache.org
>> 
>> dev@griffin.incubator.apache.org
>> 
>> commits@griffin.incubator.apache.org
>> 
>> Subversion Directory
>> 
>> Git is the preferred source control system.
>> 
>> Issue Tracking
>> 
>> JIRA
>> 
>> Other Resources
>> 
>> The existing code already has unit tests so we will make use of
>> existing Apache continuous testing infrastructure. The resulting load
>> should not be very large.
>> 
>> Initial Committers
>> 
>> William Go
>> Alex Lv
>> Vincent Zhao
>> Shawn Sha
>> John Liu
>> Liang Shao
>> 
>> Affiliations
>> 
>> The initial committers are employees of eBay Inc.
>> 
>> Sponsors
>> 
>> Champion
>> 
>> Henry Saputra (hsaputra@apache.org)
>> 
>> Nominated Mentors
>> 
>> Kasper Sørensen (kaspersor@apache.org)
>> 
>> Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
>> 
>> Luciano Resende (luckbr1975@gmail.com)
>> 
>> Sponsoring Entity
>> 
>> We are requesting the Incubator to sponsor this project.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 

Re: [DISCUSS] Proposing Griffin for Apache incubator

Posted by Jochen Theodorou <bl...@gmx.org>.
just a remark on the name. Griffin is used in a lot of company names and 
is a family name. That may induce trademark problems, but I did not do 
any research on that.

Well and I am not happy about Griffin and Griffon being so near 
together, even though the later project most likely has no trademark as 
such.

On 24.11.2016 00:30, Henry Saputra wrote:
> Hi All,
>
> As the champion for Griffin, I would like to bring up discussion to
> bring the project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Here is the link to the proposal:
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish all the functionalities of
> Griffin, such as register data asset, create data quality model,
> publish metrics, retrieve metrics, add subscription, etc. So, the
> developers can develop their own user interface based on these web
> services.
>
> Background
>
> At eBay, when people play with big data in Apache Hadoop (or other
> streaming data), data quality often becomes one big challenge.
> Different teams have built customized data quality tools to detect and
> analyze data quality issues within their own domain. We are thinking
> to take a platform approach to provide shared Infrastructure and
> generic features to solve common data quality pain points. This would
> enable us to build trusted data assets.
>
> Currently it\u2019s very difficult and costly to do data quality validation
> when we have big data flow across multi-platforms at eBay (e.g.
> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> MongoDB). Take eBay real time personalization platform as an example.
> Every day we have to validate data quality status for ~600M records (
> imagine we have 150M active users for our website). Data quality often
> becomes one big challenge both in its streaming and batch pipelines.
>
> So we conclude 3 data quality problems at eBay:
>
> Lack of end2end unified view of data quality measurement from multiple
> data sources to target applications, it usually takes a long time to
> identify and fix poor data quality.
> How to get data quality measured in streaming mode, we need to have a
> process and tool to visualize data quality insights through
> registering dataset which you want to check data quality, creating
> data quality measurement model, executing the data quality validation
> job and getting metrics insights for action taking.
> No Shared platform and API Service, have to apply and manage own
> hardware and software infrastructure.
>
> Rationale
>
> The challenge we face at eBay is that our data volume is becoming
> bigger and bigger, system processes become more complex, while we do
> not have a unified data quality solution to ensure the trusted data
> sets which provide confidences on data quality to our data consumers.
> The key challenges on data quality includes:
>
> Existing commercial data quality solution cannot address data quality
> lineage among systems, cannot scale out to support fast growing data
> at eBay
> Existing eBay's domain specific tools take a long time to identify and
> fix poor data quality when data flowed through multiple systems
> Business logic becomes complex, requires data quality system much flexible.
>
> Some data quality issues do have business impact on user experiences,
> revenue, efficiency & compliance.
>
> Communication overhead of data quality metrics, typically in a big
> organization, which involve different teams.
>
> The idea of Griffin is to provide Data Quality validation as a
> Service, to allow data engineers and data consumers to have:
>
> Near real-time understanding of the data quality health of your data
> pipelines with end-to-end monitoring, all in one place.
> Profiling, detecting and correlating issues and providing
> recommendations that drive rapid and focused troubleshooting
> A centralized data quality model management system including rule,
> metadata, scheduler etc.
> Native code generation to run everywhere, including Hadoop, Kafka, Spark, etc.
> One set of tools to build data quality pipelines across all eBay data platforms.
>
> Current Status
>
> Meritocracy
>
> Griffin has been deployed in production at eBay and provided the
> centralized data quality service for several eBay systems ( for
> example, real time personalization platform, eBay real time ID linking
> platform, Hadoop datasets, Site speed analytics platform). Our aim is
> to build a diverse developer and user community following the Apache
> meritocracy model. We will encourage contributions and participation
> of all types of work, and ensure that contributors are appropriately
> recognized.
>
> Community
>
> Currently the project is being developed at eBay. It's only for eBay
> internal community. Griffin seeks to develop the developer and user
> communities during incubation. We believe it will grow substantially
> by becoming an Apache project.
>
> Core Developers
>
> Griffin is currently being designed and developed by engineers from
> eBay Inc. \u2013 William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> All of these core developers have deep expertise in Apache Hadoop and
> the Hadoop Ecosystem in general.
>
> Alignment
>
> The ASF is a natural host for Griffin given that it is already the
> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> emerging big data products. Those are requiring data quality solution
> by nature to ensure the data quality which they processed. When people
> use open source data technology, the big question to them is that how
> we can ensure the data quality in it. Griffin leverages lot of Apache
> open-source products. Griffin was designed to enable real time
> insights into data quality validation by shared Infrastructure and
> generic features to solve common data quality pain points.
>
> Known Risks
>
> Orphaned Products
>
> The core developers of Griffin team work full time on this project.
> There is no risk of Griffin getting orphaned since at least one large
> company (eBay) is extensively using it in their production Hadoop and
> Spark clusters for multiple data systems. For example, currently there
> are 4 data systems at eBay (real time personalization platform, eBay
> real time ID linking platform, Hadoop, Site speed analytics platform)
> are leveraging Griffin, with more than ~600M records for data quality
> status validation every day, 35 data sets being monitored, 50+ data
> quality models have been created.
>
> As Griffin is designed to connect many types of data sources, we are
> very confident that they will use Griffin as a service for ensuring
> the data quality in open source data ecosystems. We plan to extend and
> diversify this community further through Apache.
>
> Inexperience with Open Source
>
> Griffin's core engineers are all active users and followers of open
> source projects. They are already committers and contributors to the
> Griffin Github project. All have been involved with the source code
> that has been released under an open source license, and several of
> them also have experience developing code in an open source
> environment. Though the core set of Developers do not have Apache Open
> Source experience, there are plans to onboard individuals with Apache
> open source experience on to the project.
>
> Homogenous Developers
>
> The core developers are from eBay. Apache Incubation process
> encourages an open and diverse meritocratic community. Griffin intends
> to make every possible effort to build a diverse, vibrant and involved
> community. We are committed to recruiting additional committers from
> other companies based on their contribution to the project.
>
> Reliance on Salaried Developers
>
> eBay invested in Griffin as a company-wide data quality service
> platform and some of its key engineers are working full time on the
> project. they are all paid by eBay. We look forward to other Apache
> developers and researchers to contribute to the project.
>
> Relationships with Other Apache Products
>
> Griffin has a strong relationship and dependency with Apache Hadoop,
> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> Hive. In addition, since there is a growing need for data quality
> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> being part of Apache\u2019s Incubation community, could help with a closer
> collaboration among these four projects and as well as others.
>
> Documentation
>
> Information about Griffin can be found at https://github.com/eBay/griffin
>
> Initial Source
>
> Griffin has been under development since early 2016 by a team of
> engineers at eBay Inc. It is currently hosted on Github.com under an
> Apache license 2.0 at https://github.com/eBay/griffin . Once in
> incubation we will be moving the code base to apache git library.
>
> External Dependencies
>
> Griffin has the following external dependencies.
>
> Basic
>
> JDK 1.7+
> Scala
> Apache Maven
> JUnit
> Log4j
> Slf4j
> Apache Commons
>
> Hadoop
>
> Apache Hadoop
> Apache HBase
> Apache Hive
>
> DB
>
> InfluxData
>
> Apache Spark
>
> Spark Core Library
>
> REST Service
>
> Jersey
> Spring MVC
>
> Web frontend
>
> AngularJS
> jQuery
> Bootstrap
> RequireJS
> eCharts
> Font Awesome
>
> Cryptography
>
> Currently there's no cryptography in Griffin.
>
> Required Resources
>
> Mailing List
>
> We currently use eBay mail box to communicate, but we'd like to move
> that to ASF maintained mailing lists.
>
> Current mailing list: ebay-griffin-devs@googlegroups.com
>
> Proposed ASF maintained lists:
>
> private@griffin.incubator.apache.org
>
> dev@griffin.incubator.apache.org
>
> commits@griffin.incubator.apache.org
>
> Subversion Directory
>
> Git is the preferred source control system.
>
> Issue Tracking
>
> JIRA
>
> Other Resources
>
> The existing code already has unit tests so we will make use of
> existing Apache continuous testing infrastructure. The resulting load
> should not be very large.
>
> Initial Committers
>
> William Go
> Alex Lv
> Vincent Zhao
> Shawn Sha
> John Liu
> Liang Shao
>
> Affiliations
>
> The initial committers are employees of eBay Inc.
>
> Sponsors
>
> Champion
>
> Henry Saputra (hsaputra@apache.org)
>
> Nominated Mentors
>
> Kasper S�rensen (kaspersor@apache.org)
>
> Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
>
> Luciano Resende (luckbr1975@gmail.com)
>
> Sponsoring Entity
>
> We are requesting the Incubator to sponsor this project.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org