You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Jean-Baptiste Onofré <jb...@nanthrax.net> on 2016/05/19 03:52:52 UTC
[DISCUSS] CarbonData incubation proposal
Hi all,
We would like to discuss about a new proposal for the incubator: CarbonData.
CarbonData is a new Apache Hadoop native file format for faster
interactive query using advanced columnar storage, index, compression
and encoding techniques to improve computing efficiency, in turn it will
help speedup queries an order of magnitude faster over PetaBytes of data.
The proposal is included below and also available on the wiki:
https://wiki.apache.org/incubator/CarbonDataProposal
Please, provide any feedback or comment.
Thanks !
Regards
JB
= Apache CarbonData =
== Abstract ==
Apache CarbonData is a new Apache Hadoop native file format for faster
interactive
query using advanced columnar storage, index, compression and encoding
techniques
to improve computing efficiency, in turn it will help speedup queries an
order of
magnitude faster over PetaBytes of data.
CarbonData github address: https://github.com/HuaweiBigData/carbondata
== Backgrounad ==
Huawei is an ICT solution provider, we are committed to enhancing
customer experiences for telecom carriers, enterprises, and consumers on
big data, In order to satisfy the following customer requirements, we
created a new Hadoop native file format:
* Support interactive OLAP-style query over big data in seconds.
* Support fast query on individual record which require touching all
fields.
* Fast data loading speed and support incremental load in period of
minutes.
* Support HDFS so that customer can leverage existing Hadoop cluster.
* Support time based data retention.
Based on these requirements, we investigated existing file formats in
the Hadoop eco-system, but we could not find a suitable solution that
satisfying requirements all at the same time, so we start designing
CarbonData.
== Rationale ==
CarbonData contains multiple modules, which are classified into two
categories:
1. CarbonData File Format: which contains core implementation for file
format such as columnar,index,dictionary,encoding+compression,API for
reading/writing etc.
2. CarbonData integration with big data processing framework such as
Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
the execution runtime.
=== CarbonData File Format ===
CarbonData file format is a columnar store in HDFS, it has many features
that a modern columnar format has, such as splittable, compression
schema ,complex data type etc. And CarbonData has following unique features:
==== Indexing ====
In order to support fast interactive query, CarbonData leverage indexing
technology to reduce I/O scans. CarbonData files stores data along with
index, the index is not stored separately but the CarbonData file itself
contains the index. In current implementation, CarbonData supports 3
types of indexing:
1. Multi-dimensional Key (B+ Tree index)
The Data block are written in sequence to the disk and within each
data blocks each column block is written in sequence. Finally, the
metadata block for the file is written with information about byte
positions of each block in the file, Min-Max statistics index and the
start and end MDK of each data block. Since, the entire data in the file
is in sorted order, the start and end MDK of each data block can be used
to construct a B+Tree and the file can be logically represented as a
B+Tree with the data blocks as leaf nodes (on disk) and the remaining
non-leaf nodes in memory.
2. Inverted index
Inverted index is widely used in search engine. By using this index,
it helps processing/query engine to do filtering inside one HDFS block.
Furthermore, query acceleration for count distinct like operation is
made possible when combining bitmap and inverted index in query time.
3. MinMax index
For all columns, minmax index is created so that processing/query
engine can skip scan that is not required.
==== Global Dictionary ====
Besides I/O reduction, CarbonData accelerates computation by using
global dictionary, which enables processing/query engines to perform all
processing on encoded data without having to convert the data (Late
Materialization). We have observed dramatic performance improvement for
OLAP analytic scenario where table contains many columns in string data
type. The data is converted back to the user readable form just before
processing/query engine returning results to user.
==== Column Group ====
Sometimes users want to perform processing/query on multi-columns in one
table, for example, performing scan for individual record in
troubleshooting scenario. In this case, row format is more efficient
than columnar format since all columns will be touched by the workload.
To accelerate this, CarbonData supports storing a group of column in row
format, so data in column group is stored together and enable fast
retrieval.
==== Optimized for multiple use cases ====
CarbonData indices and dictionary is highly configurable. To make
storage optimized for different use cases, user can configure what to
index, so user can decide and tune the format before loading data into
CarbonData.
For example
|| Use Case || Supporting Features ||
|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
Tree index), Minmax index, Inverted index ||
|| High throughput scan || Global dictionary, Minmax index ||
|| Low latency point query || Multi-dimensional Key (B+ Tree index),
Partitioning ||
|| Individual record query || Column group, Global dictionary ||
=== BigData Processing Framework Integration ===
* CarbonData provides InputFormat/OutputFormat interfaces for
Reading/Writing data from the CarbonData files and at the same time
provides abstract API for processing data stored as Carbondata format
with data processing framework.
* CarbonData provides deep integration with Apache Spark including
predicate push down, column pruning, aggregation push down etc. So users
can use Spark SQL to connect and query from CarbonData.
* CarbonData can integrate with various big data Query/Processing
framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
Example:
https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
== Initial Goals ==
Our initial goals are to bring CarbonData into the ASF, transition
internal engineering processes into the open, and foster a collaborative
development model according to the "Apache Way".
== Current Status ==
CarbonData is production ready and already provide a large set of features.
The current license is already Apache 2.0.
== Meritocracy ==
We intend to radically expand the initial developer and user community
by running the project in accordance with the "Apache Way". Users and
new contributors will be treated with respect and welcomed. By
participating in the community and providing quality patches/support
that move the project forward, they will earn merit. They also will be
encouraged to provide non-code contributions (documentation, events,
community management, etc.) and will gain merit for doing so. Those with
a proven support and quality track record will be encouraged to become
committers.
== Community ==
If CarbonData is accepted for incubation, the primary initial goal is to
build a large community. We really trust that CarbonData will become a
key project for big data column-like platforms, and so, we bet on a
large community of users and developers.
== Known Risks ==
Development has been sponsored mostly by a one company.For the project
to fully transition to the Apache Way governance model, development must
shift towards the meritocracy-centric model of growing a community of
contributors balanced with the needs for extreme stability and core
implementation coherency.
== Orphaned products ==
Huawei is fully committed CarbonData. Moreover, Huawei has a vested
interest in making CarbonData succeed by driving its close integration
with sister ASF projects. We expect this to further reduces the risk of
orphaning the product.
== Inexperience with Open Source ==
Huawei has been developing and using open source software since a long
time. Additionally, several ASF veterans agreed to mentor the project
and are listed in this proposal. The project will rely on their guidance
and collective wisdom to quickly transition the entire team of initial
committers towards practicing the Apache Way.
== Reliance on Salaried Developers ==
Most of the contributors are paid to work in big data space. While they
might wander from their current employers, they are unlikely to venture
far from their core expertises and thus will continue to be engaged with
the project regardless of their current employers.
== An Excessive Fascination with the Apache Brand ==
While we intend to leverage the Apache \u2018branding\u2019 when talking to other
projects as testament of our project\u2019s \u2018neutrality\u2019, we have no plans
for making use of Apache brand in press releases nor posting billboards
advertising acceptance of CarbonData into Apache Incubator.
== Initial Source ==
https://github.com/HuaweiBigData/carbondata.git
== External Dependencies ==
All external dependencies are licensed under an Apache 2.0 license or
Apache-compatible license. As we grow the Carbondata community we will
configure our build process to require and validate all contributions
and dependencies are licensed under the Apache 2.0 license or are under
an Apache-compatible license.
* Apache Spark
* Apache Hadoop
* Apache Maven
* Apache Commons
* Apache Log4j
* Apache Thrift
* Apache Zookeeper
* Scala
* Snappy
* Kettle (Pentaho)
* Eigenbase
* Fastutil
* GSON
* Jmockit
* Junit
== Required Resources ==
=== Mailing lists ===
* private@carbondata.incubator.apache.org (moderated subscriptions)
* commits@carbondata.incubator.apache.org
* dev@carbondata.incubator.apache.org
* issues@carbondata.incubator.apache.org
=== Git Repository ===
* https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
=== Issue Tracking ===
* JIRA Project CarbonData (CarbonData)
=== Initial Committers ===
* Liang Chenliang
* Jean-Baptiste Onofr�
* Henry Saputra
* Uma Maheswara Rao G
* Jenny MA
* Jacky Likun
* Vimal Das Kammath
* Jarray Qiuheng
=== Affiliations ===
* Huawei: Liang Chenliang
* Talend: Jean-Baptiste Onofr�
* Ebay: Henry Saputra
* Intel: Uma Maheswara Rao G
=== Sponsors ===
=== Champion ===
* Jean-Baptiste Onofr� - Apache Member
=== Mentors ===
* Henry Saputra (eBay)
* Jean-Baptiste Onofr� (Talend)
* Uma Maheswara Rao G (Intel)
=== Sponsoring Entity ===
The Apache Incubator
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by "Gangumalla, Uma" <um...@intel.com>.
+1 (binding)
Regards,
Uma
On 5/18/16, 8:52 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:
>Hi all,
>
>We would like to discuss about a new proposal for the incubator:
>CarbonData.
>
>CarbonData is a new Apache Hadoop native file format for faster
>interactive query using advanced columnar storage, index, compression
>and encoding techniques to improve computing efficiency, in turn it will
>help speedup queries an order of magnitude faster over PetaBytes of data.
>
>The proposal is included below and also available on the wiki:
>
>https://wiki.apache.org/incubator/CarbonDataProposal
>
>Please, provide any feedback or comment.
>
>Thanks !
>Regards
>JB
>
>= Apache CarbonData =
>
>== Abstract ==
>
>Apache CarbonData is a new Apache Hadoop native file format for faster
>interactive
>query using advanced columnar storage, index, compression and encoding
>techniques
>to improve computing efficiency, in turn it will help speedup queries an
>order of
>magnitude faster over PetaBytes of data.
>
>CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
>== Backgrounad ==
>
>Huawei is an ICT solution provider, we are committed to enhancing
>customer experiences for telecom carriers, enterprises, and consumers on
>big data, In order to satisfy the following customer requirements, we
>created a new Hadoop native file format:
>
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all
>fields.
> * Fast data loading speed and support incremental load in period of
>minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
>
>Based on these requirements, we investigated existing file formats in
>the Hadoop eco-system, but we could not find a suitable solution that
>satisfying requirements all at the same time, so we start designing
>CarbonData.
>
>== Rationale ==
>
>CarbonData contains multiple modules, which are classified into two
>categories:
>
> 1. CarbonData File Format: which contains core implementation for file
>format such as columnar,index,dictionary,encoding+compression,API for
>reading/writing etc.
> 2. CarbonData integration with big data processing framework such as
>Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
>the execution runtime.
>
>=== CarbonData File Format ===
>
>CarbonData file format is a columnar store in HDFS, it has many features
>that a modern columnar format has, such as splittable, compression
>schema ,complex data type etc. And CarbonData has following unique
>features:
>
>==== Indexing ====
>
>In order to support fast interactive query, CarbonData leverage indexing
>technology to reduce I/O scans. CarbonData files stores data along with
>index, the index is not stored separately but the CarbonData file itself
>contains the index. In current implementation, CarbonData supports 3
>types of indexing:
>
>1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each
>data blocks each column block is written in sequence. Finally, the
>metadata block for the file is written with information about byte
>positions of each block in the file, Min-Max statistics index and the
>start and end MDK of each data block. Since, the entire data in the file
>is in sorted order, the start and end MDK of each data block can be used
>to construct a B+Tree and the file can be logically represented as a
>B+Tree with the data blocks as leaf nodes (on disk) and the remaining
>non-leaf nodes in memory.
>2. Inverted index
> Inverted index is widely used in search engine. By using this index,
>it helps processing/query engine to do filtering inside one HDFS block.
>Furthermore, query acceleration for count distinct like operation is
>made possible when combining bitmap and inverted index in query time.
>3. MinMax index
> For all columns, minmax index is created so that processing/query
>engine can skip scan that is not required.
>
>==== Global Dictionary ====
>
>Besides I/O reduction, CarbonData accelerates computation by using
>global dictionary, which enables processing/query engines to perform all
>processing on encoded data without having to convert the data (Late
>Materialization). We have observed dramatic performance improvement for
>OLAP analytic scenario where table contains many columns in string data
>type. The data is converted back to the user readable form just before
>processing/query engine returning results to user.
>
>==== Column Group ====
>
>Sometimes users want to perform processing/query on multi-columns in one
>table, for example, performing scan for individual record in
>troubleshooting scenario. In this case, row format is more efficient
>than columnar format since all columns will be touched by the workload.
>To accelerate this, CarbonData supports storing a group of column in row
>format, so data in column group is stored together and enable fast
>retrieval.
>
>==== Optimized for multiple use cases ====
>
>CarbonData indices and dictionary is highly configurable. To make
>storage optimized for different use cases, user can configure what to
>index, so user can decide and tune the format before loading data into
>CarbonData.
>
>For example
>
>|| Use Case || Supporting Features ||
>|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
>Tree index), Minmax index, Inverted index ||
>|| High throughput scan || Global dictionary, Minmax index ||
>|| Low latency point query || Multi-dimensional Key (B+ Tree index),
>Partitioning ||
>|| Individual record query || Column group, Global dictionary ||
>
>=== BigData Processing Framework Integration ===
>
> * CarbonData provides InputFormat/OutputFormat interfaces for
>Reading/Writing data from the CarbonData files and at the same time
>provides abstract API for processing data stored as Carbondata format
>with data processing framework.
> * CarbonData provides deep integration with Apache Spark including
>predicate push down, column pruning, aggregation push down etc. So users
>can use Spark SQL to connect and query from CarbonData.
> * CarbonData can integrate with various big data Query/Processing
>framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
>Example:
>https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/
>scala/org/carbondata/examples/CarbonExample.scala
>
>== Initial Goals ==
>
>Our initial goals are to bring CarbonData into the ASF, transition
>internal engineering processes into the open, and foster a collaborative
>development model according to the "Apache Way".
>
>== Current Status ==
>
>CarbonData is production ready and already provide a large set of
>features.
>The current license is already Apache 2.0.
>
>== Meritocracy ==
>
>We intend to radically expand the initial developer and user community
>by running the project in accordance with the "Apache Way". Users and
>new contributors will be treated with respect and welcomed. By
>participating in the community and providing quality patches/support
>that move the project forward, they will earn merit. They also will be
>encouraged to provide non-code contributions (documentation, events,
>community management, etc.) and will gain merit for doing so. Those with
>a proven support and quality track record will be encouraged to become
>committers.
>
>== Community ==
>
>If CarbonData is accepted for incubation, the primary initial goal is to
>build a large community. We really trust that CarbonData will become a
>key project for big data column-like platforms, and so, we bet on a
>large community of users and developers.
>
>== Known Risks ==
>
>Development has been sponsored mostly by a one company.For the project
>to fully transition to the Apache Way governance model, development must
>shift towards the meritocracy-centric model of growing a community of
>contributors balanced with the needs for extreme stability and core
>implementation coherency.
>
>== Orphaned products ==
>
>Huawei is fully committed CarbonData. Moreover, Huawei has a vested
>interest in making CarbonData succeed by driving its close integration
>with sister ASF projects. We expect this to further reduces the risk of
>orphaning the product.
>
>== Inexperience with Open Source ==
>
>Huawei has been developing and using open source software since a long
>time. Additionally, several ASF veterans agreed to mentor the project
>and are listed in this proposal. The project will rely on their guidance
>and collective wisdom to quickly transition the entire team of initial
>committers towards practicing the Apache Way.
>
>== Reliance on Salaried Developers ==
>
>Most of the contributors are paid to work in big data space. While they
>might wander from their current employers, they are unlikely to venture
>far from their core expertises and thus will continue to be engaged with
>the project regardless of their current employers.
>
>== An Excessive Fascination with the Apache Brand ==
>
>While we intend to leverage the Apache Œbranding¹ when talking to other
>projects as testament of our project¹s Œneutrality¹, we have no plans
>for making use of Apache brand in press releases nor posting billboards
>advertising acceptance of CarbonData into Apache Incubator.
>
>== Initial Source ==
>
>https://github.com/HuaweiBigData/carbondata.git
>
>== External Dependencies ==
>
>All external dependencies are licensed under an Apache 2.0 license or
>Apache-compatible license. As we grow the Carbondata community we will
>configure our build process to require and validate all contributions
>and dependencies are licensed under the Apache 2.0 license or are under
>an Apache-compatible license.
>
> * Apache Spark
> * Apache Hadoop
> * Apache Maven
> * Apache Commons
> * Apache Log4j
> * Apache Thrift
> * Apache Zookeeper
> * Scala
> * Snappy
> * Kettle (Pentaho)
> * Eigenbase
> * Fastutil
> * GSON
> * Jmockit
> * Junit
>
>== Required Resources ==
>
>=== Mailing lists ===
>
> * private@carbondata.incubator.apache.org (moderated subscriptions)
> * commits@carbondata.incubator.apache.org
> * dev@carbondata.incubator.apache.org
> * issues@carbondata.incubator.apache.org
>
>=== Git Repository ===
>
> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
>=== Issue Tracking ===
>
> * JIRA Project CarbonData (CarbonData)
>
>=== Initial Committers ===
>
> * Liang Chenliang
> * Jean-Baptiste Onofré
> * Henry Saputra
> * Uma Maheswara Rao G
> * Jenny MA
> * Jacky Likun
> * Vimal Das Kammath
> * Jarray Qiuheng
>
>=== Affiliations ===
>
> * Huawei: Liang Chenliang
> * Talend: Jean-Baptiste Onofré
> * Ebay: Henry Saputra
> * Intel: Uma Maheswara Rao G
>
>=== Sponsors ===
>
>=== Champion ===
>
> * Jean-Baptiste Onofré - Apache Member
>
>=== Mentors ===
>
> * Henry Saputra (eBay)
> * Jean-Baptiste Onofré (Talend)
> * Uma Maheswara Rao G (Intel)
>
>=== Sponsoring Entity ===
>
>The Apache Incubator
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Fully agree !
Regards
JB
On 05/23/2016 05:31 PM, Luke Han wrote:
> Hi Jean-Baptiste,
> My point is not saying such stuff have to be done before vote...it's
> my suggestion for your team to continue work on that.
>
> And I would like to say, for proposal to incubating, it's not only
> code donation, it's better to have more discussion to bring a clear picture
> to community about the project's purpose, design, community and so on.
> There are many people may have interesting to join, it's good time for you
> to engage contributors now;-)
>
> Thanks.
> Luke
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Hi Luke,
>>
>> I fully agree with you. The committers are already involved to clean-up
>> the repo (PRs have been created).
>>
>> IMHO, this step is decoupled from the proposal vote itself: the only
>> requirement is to do it for the code donation, after the proposal vote.
>>
>> Regards
>> JB
>>
>>
>> On 05/21/2016 08:48 AM, Luke Han wrote:
>>
>>> Would love to see Huawei finally decided to open source and contribute
>>> this
>>> project to ASF.
>>>
>>> As previous discussion, license should be very clear, I think you have a
>>> lot of work to do:)
>>>
>>> Thanks.
>>>
>>>
>>> Best Regards!
>>> ---------------------
>>>
>>> Luke Han
>>>
>>> On Thu, May 19, 2016 at 11:46 PM, Jacky Li <13...@qq.com> wrote:
>>>
>>> Hi Julian Hyde,
>>>>
>>>> Yes, you are correct, thanks for pointing out this. Actually in early
>>>> days
>>>> of CarbonData project, it is inspired by Mondarin. Mondarin is a great
>>>> OLAP
>>>> project that we have learned much from.
>>>>
>>>> The code you are refering to, "CarbonDef.java, DimensionType.java,
>>>> LevelType.java" I believe, is used in earlier version of CarbonData but
>>>> it
>>>> is no longer used in the currnet version of CarbonData. Actually there
>>>> are
>>>> quite a few packages are no longer needed but still present in the repo,
>>>> so
>>>> we are planning to clean up the code base soon.
>>>>
>>>> Definitely, you are right, we will make sure all source code is under
>>>> Apache
>>>> License only.
>>>>
>>>> Regards,
>>>> Jacky Li
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49678.html
>>>> Sent from the Apache Incubator - General mailing list archive at
>>>> Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>
>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
>
--
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Luke Han <lu...@gmail.com>.
Hi Jean-Baptiste,
My point is not saying such stuff have to be done before vote...it's
my suggestion for your team to continue work on that.
And I would like to say, for proposal to incubating, it's not only
code donation, it's better to have more discussion to bring a clear picture
to community about the project's purpose, design, community and so on.
There are many people may have interesting to join, it's good time for you
to engage contributors now;-)
Thanks.
Luke
Best Regards!
---------------------
Luke Han
On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:
> Hi Luke,
>
> I fully agree with you. The committers are already involved to clean-up
> the repo (PRs have been created).
>
> IMHO, this step is decoupled from the proposal vote itself: the only
> requirement is to do it for the code donation, after the proposal vote.
>
> Regards
> JB
>
>
> On 05/21/2016 08:48 AM, Luke Han wrote:
>
>> Would love to see Huawei finally decided to open source and contribute
>> this
>> project to ASF.
>>
>> As previous discussion, license should be very clear, I think you have a
>> lot of work to do:)
>>
>> Thanks.
>>
>>
>> Best Regards!
>> ---------------------
>>
>> Luke Han
>>
>> On Thu, May 19, 2016 at 11:46 PM, Jacky Li <13...@qq.com> wrote:
>>
>> Hi Julian Hyde,
>>>
>>> Yes, you are correct, thanks for pointing out this. Actually in early
>>> days
>>> of CarbonData project, it is inspired by Mondarin. Mondarin is a great
>>> OLAP
>>> project that we have learned much from.
>>>
>>> The code you are refering to, "CarbonDef.java, DimensionType.java,
>>> LevelType.java" I believe, is used in earlier version of CarbonData but
>>> it
>>> is no longer used in the currnet version of CarbonData. Actually there
>>> are
>>> quite a few packages are no longer needed but still present in the repo,
>>> so
>>> we are planning to clean up the code base soon.
>>>
>>> Definitely, you are right, we will make sure all source code is under
>>> Apache
>>> License only.
>>>
>>> Regards,
>>> Jacky Li
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49678.html
>>> Sent from the Apache Incubator - General mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
Re: [DISCUSS] CarbonData incubation proposal
Posted by "Gangumalla, Uma" <um...@intel.com>.
Since the Parquet is ASF project, referencing may make sense to me,Yes, I
think Parquet project guys can comment on this point whether is it make
sense for them.
>With the usual precautions of a BD scan on the incoming
IP and ongoing diligence by the PPMC we'll be fine.
Thanks Julian for this point. So this should be action item for PPMC. So,
we can be good with voting on proposal I guess right?
Appreciate on the reviews. We can list out if any other points to fix
before code movement may be. Thanks Jacky for cleanup unused or refined
stuff.
Regards,
Uma
On 5/23/16, 4:19 PM, "Henry Saputra" <he...@gmail.com> wrote:
>I thought the concern had been addressed?
>
>For Julian concern about Mondrian, the code was inspired by Mondrian but
>do
>not have direct derivatives of the code.
>According to Jacky, the old code is no longer used.
>
>As for Julien concern about Parquet, the design seemed to be inspired by
>Parquet and ORC.
>And if needed, we could add reference to Parquet in the code
>documentation.
>Since Parquet is ASF project, I believe we are in good shape in CarbonData
>goes to ASF.
>
>If any other action item is needed please do suggest so we could make
>correction as part of incubation process.
>
>
>- Henry
>
>On Mon, May 23, 2016 at 4:12 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>wrote:
>
>> On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey
>><ma...@rectangular.com>
>> wrote:
>> > On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré
>><jb...@nanthrax.net>
>> wrote:
>> >> Hi Luke,
>> >>
>> >> I fully agree with you. The committers are already involved to
>>clean-up
>> the
>> >> repo (PRs have been created).
>> >>
>> >> IMHO, this step is decoupled from the proposal vote itself: the only
>> >> requirement is to do it for the code donation, after the proposal
>>vote.
>> >
>> > What will the process for this be? On this thread we have two outside
>> > authors recognizing their own work, but that's obviously not a
>> > realistic mechanism for identifying all potentially problematic IP.
>>
>> Given that this is a donation from a corporate entity a request for BD
>> (or similar)
>> scan results (if they are available) may help. That's how every
>> corporate-sponsored
>> donation (at least dozens I've been involved in) does risk mitigation
>> anyway.
>>
>> Thanks,
>> Roman.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by JihongMa <ji...@gmail.com>.
Thank you Julian!!
we are going to prepare a BD scan result to ensure source code clearance is
done properly, and as an ongoing effort to do a diligent job as we move
forward on this regards.
Regards.
Jihong
--
View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49756.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,
+1, I asked to the current contributors to review and check all code in
order to cleanup and identify "used/inspired/derived" code.
On the other hand, I also said to Liang that a SGA will be required.
I think we can start a vote and address the code cleanup in the mean time.
Regards
JB
On 05/24/2016 02:15 AM, Julian Hyde wrote:
> For the record, at the time that I reviewed the github repo, there was
> code that was not merely *inspired* by Mondrian code, but *derived*
> from Mondrian code. But that code has since been removed, so the issue
> is resolved. With the usual precautions of a BD scan on the incoming
> IP and ongoing diligence by the PPMC we'll be fine.
>
> Julian
>
>
> On Mon, May 23, 2016 at 4:19 PM, Henry Saputra <he...@gmail.com> wrote:
>> I thought the concern had been addressed?
>>
>> For Julian concern about Mondrian, the code was inspired by Mondrian but do
>> not have direct derivatives of the code.
>> According to Jacky, the old code is no longer used.
>>
>> As for Julien concern about Parquet, the design seemed to be inspired by
>> Parquet and ORC.
>> And if needed, we could add reference to Parquet in the code documentation.
>> Since Parquet is ASF project, I believe we are in good shape in CarbonData
>> goes to ASF.
>>
>> If any other action item is needed please do suggest so we could make
>> correction as part of incubation process.
>>
>>
>> - Henry
>>
>> On Mon, May 23, 2016 at 4:12 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>> wrote:
>>
>>> On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey <ma...@rectangular.com>
>>> wrote:
>>>> On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
>>> wrote:
>>>>> Hi Luke,
>>>>>
>>>>> I fully agree with you. The committers are already involved to clean-up
>>> the
>>>>> repo (PRs have been created).
>>>>>
>>>>> IMHO, this step is decoupled from the proposal vote itself: the only
>>>>> requirement is to do it for the code donation, after the proposal vote.
>>>>
>>>> What will the process for this be? On this thread we have two outside
>>>> authors recognizing their own work, but that's obviously not a
>>>> realistic mechanism for identifying all potentially problematic IP.
>>>
>>> Given that this is a donation from a corporate entity a request for BD
>>> (or similar)
>>> scan results (if they are available) may help. That's how every
>>> corporate-sponsored
>>> donation (at least dozens I've been involved in) does risk mitigation
>>> anyway.
>>>
>>> Thanks,
>>> Roman.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
--
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Henry Saputra <he...@gmail.com>.
+1 to that, Julian. You were absolutely right, I apologize I did not make
it clear.
Really appreciate another set of eyes reviewing it.
- Henry
On Mon, May 23, 2016 at 5:15 PM, Julian Hyde <jh...@apache.org> wrote:
> For the record, at the time that I reviewed the github repo, there was
> code that was not merely *inspired* by Mondrian code, but *derived*
> from Mondrian code. But that code has since been removed, so the issue
> is resolved. With the usual precautions of a BD scan on the incoming
> IP and ongoing diligence by the PPMC we'll be fine.
>
> Julian
>
>
> On Mon, May 23, 2016 at 4:19 PM, Henry Saputra <he...@gmail.com>
> wrote:
> > I thought the concern had been addressed?
> >
> > For Julian concern about Mondrian, the code was inspired by Mondrian but
> do
> > not have direct derivatives of the code.
> > According to Jacky, the old code is no longer used.
> >
> > As for Julien concern about Parquet, the design seemed to be inspired by
> > Parquet and ORC.
> > And if needed, we could add reference to Parquet in the code
> documentation.
> > Since Parquet is ASF project, I believe we are in good shape in
> CarbonData
> > goes to ASF.
> >
> > If any other action item is needed please do suggest so we could make
> > correction as part of incubation process.
> >
> >
> > - Henry
> >
> > On Mon, May 23, 2016 at 4:12 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > wrote:
> >
> >> On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey <
> marvin@rectangular.com>
> >> wrote:
> >> > On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <
> jb@nanthrax.net>
> >> wrote:
> >> >> Hi Luke,
> >> >>
> >> >> I fully agree with you. The committers are already involved to
> clean-up
> >> the
> >> >> repo (PRs have been created).
> >> >>
> >> >> IMHO, this step is decoupled from the proposal vote itself: the only
> >> >> requirement is to do it for the code donation, after the proposal
> vote.
> >> >
> >> > What will the process for this be? On this thread we have two outside
> >> > authors recognizing their own work, but that's obviously not a
> >> > realistic mechanism for identifying all potentially problematic IP.
> >>
> >> Given that this is a donation from a corporate entity a request for BD
> >> (or similar)
> >> scan results (if they are available) may help. That's how every
> >> corporate-sponsored
> >> donation (at least dozens I've been involved in) does risk mitigation
> >> anyway.
> >>
> >> Thanks,
> >> Roman.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >> For additional commands, e-mail: general-help@incubator.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
Re: [DISCUSS] CarbonData incubation proposal
Posted by Julian Hyde <jh...@apache.org>.
For the record, at the time that I reviewed the github repo, there was
code that was not merely *inspired* by Mondrian code, but *derived*
from Mondrian code. But that code has since been removed, so the issue
is resolved. With the usual precautions of a BD scan on the incoming
IP and ongoing diligence by the PPMC we'll be fine.
Julian
On Mon, May 23, 2016 at 4:19 PM, Henry Saputra <he...@gmail.com> wrote:
> I thought the concern had been addressed?
>
> For Julian concern about Mondrian, the code was inspired by Mondrian but do
> not have direct derivatives of the code.
> According to Jacky, the old code is no longer used.
>
> As for Julien concern about Parquet, the design seemed to be inspired by
> Parquet and ORC.
> And if needed, we could add reference to Parquet in the code documentation.
> Since Parquet is ASF project, I believe we are in good shape in CarbonData
> goes to ASF.
>
> If any other action item is needed please do suggest so we could make
> correction as part of incubation process.
>
>
> - Henry
>
> On Mon, May 23, 2016 at 4:12 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
>
>> On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey <ma...@rectangular.com>
>> wrote:
>> > On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>> >> Hi Luke,
>> >>
>> >> I fully agree with you. The committers are already involved to clean-up
>> the
>> >> repo (PRs have been created).
>> >>
>> >> IMHO, this step is decoupled from the proposal vote itself: the only
>> >> requirement is to do it for the code donation, after the proposal vote.
>> >
>> > What will the process for this be? On this thread we have two outside
>> > authors recognizing their own work, but that's obviously not a
>> > realistic mechanism for identifying all potentially problematic IP.
>>
>> Given that this is a donation from a corporate entity a request for BD
>> (or similar)
>> scan results (if they are available) may help. That's how every
>> corporate-sponsored
>> donation (at least dozens I've been involved in) does risk mitigation
>> anyway.
>>
>> Thanks,
>> Roman.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Julien Le Dem <ju...@gmail.com>.
Yes i believe references to asf projects when needed is sufficient.
Julien
> On May 23, 2016, at 16:19, Henry Saputra <he...@gmail.com> wrote:
>
> I thought the concern had been addressed?
>
> For Julian concern about Mondrian, the code was inspired by Mondrian but do
> not have direct derivatives of the code.
> According to Jacky, the old code is no longer used.
>
> As for Julien concern about Parquet, the design seemed to be inspired by
> Parquet and ORC.
> And if needed, we could add reference to Parquet in the code documentation.
> Since Parquet is ASF project, I believe we are in good shape in CarbonData
> goes to ASF.
>
> If any other action item is needed please do suggest so we could make
> correction as part of incubation process.
>
>
> - Henry
>
> On Mon, May 23, 2016 at 4:12 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
>
>> On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey <ma...@rectangular.com>
>> wrote:
>>> On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>>> Hi Luke,
>>>>
>>>> I fully agree with you. The committers are already involved to clean-up
>> the
>>>> repo (PRs have been created).
>>>>
>>>> IMHO, this step is decoupled from the proposal vote itself: the only
>>>> requirement is to do it for the code donation, after the proposal vote.
>>>
>>> What will the process for this be? On this thread we have two outside
>>> authors recognizing their own work, but that's obviously not a
>>> realistic mechanism for identifying all potentially problematic IP.
>>
>> Given that this is a donation from a corporate entity a request for BD
>> (or similar)
>> scan results (if they are available) may help. That's how every
>> corporate-sponsored
>> donation (at least dozens I've been involved in) does risk mitigation
>> anyway.
>>
>> Thanks,
>> Roman.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Henry Saputra <he...@gmail.com>.
I thought the concern had been addressed?
For Julian concern about Mondrian, the code was inspired by Mondrian but do
not have direct derivatives of the code.
According to Jacky, the old code is no longer used.
As for Julien concern about Parquet, the design seemed to be inspired by
Parquet and ORC.
And if needed, we could add reference to Parquet in the code documentation.
Since Parquet is ASF project, I believe we are in good shape in CarbonData
goes to ASF.
If any other action item is needed please do suggest so we could make
correction as part of incubation process.
- Henry
On Mon, May 23, 2016 at 4:12 PM, Roman Shaposhnik <ro...@shaposhnik.org>
wrote:
> On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey <ma...@rectangular.com>
> wrote:
> > On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >> Hi Luke,
> >>
> >> I fully agree with you. The committers are already involved to clean-up
> the
> >> repo (PRs have been created).
> >>
> >> IMHO, this step is decoupled from the proposal vote itself: the only
> >> requirement is to do it for the code donation, after the proposal vote.
> >
> > What will the process for this be? On this thread we have two outside
> > authors recognizing their own work, but that's obviously not a
> > realistic mechanism for identifying all potentially problematic IP.
>
> Given that this is a donation from a corporate entity a request for BD
> (or similar)
> scan results (if they are available) may help. That's how every
> corporate-sponsored
> donation (at least dozens I've been involved in) does risk mitigation
> anyway.
>
> Thanks,
> Roman.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
Re: [DISCUSS] CarbonData incubation proposal
Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Mon, May 23, 2016 at 3:44 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>> Hi Luke,
>>
>> I fully agree with you. The committers are already involved to clean-up the
>> repo (PRs have been created).
>>
>> IMHO, this step is decoupled from the proposal vote itself: the only
>> requirement is to do it for the code donation, after the proposal vote.
>
> What will the process for this be? On this thread we have two outside
> authors recognizing their own work, but that's obviously not a
> realistic mechanism for identifying all potentially problematic IP.
Given that this is a donation from a corporate entity a request for BD
(or similar)
scan results (if they are available) may help. That's how every
corporate-sponsored
donation (at least dozens I've been involved in) does risk mitigation anyway.
Thanks,
Roman.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, May 22, 2016 at 10:57 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> Hi Luke,
>
> I fully agree with you. The committers are already involved to clean-up the
> repo (PRs have been created).
>
> IMHO, this step is decoupled from the proposal vote itself: the only
> requirement is to do it for the code donation, after the proposal vote.
What will the process for this be? On this thread we have two outside
authors recognizing their own work, but that's obviously not a
realistic mechanism for identifying all potentially problematic IP.
Marvin Humphrey
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Luke,
I fully agree with you. The committers are already involved to clean-up
the repo (PRs have been created).
IMHO, this step is decoupled from the proposal vote itself: the only
requirement is to do it for the code donation, after the proposal vote.
Regards
JB
On 05/21/2016 08:48 AM, Luke Han wrote:
> Would love to see Huawei finally decided to open source and contribute this
> project to ASF.
>
> As previous discussion, license should be very clear, I think you have a
> lot of work to do:)
>
> Thanks.
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Thu, May 19, 2016 at 11:46 PM, Jacky Li <13...@qq.com> wrote:
>
>> Hi Julian Hyde,
>>
>> Yes, you are correct, thanks for pointing out this. Actually in early days
>> of CarbonData project, it is inspired by Mondarin. Mondarin is a great OLAP
>> project that we have learned much from.
>>
>> The code you are refering to, "CarbonDef.java, DimensionType.java,
>> LevelType.java" I believe, is used in earlier version of CarbonData but it
>> is no longer used in the currnet version of CarbonData. Actually there are
>> quite a few packages are no longer needed but still present in the repo, so
>> we are planning to clean up the code base soon.
>>
>> Definitely, you are right, we will make sure all source code is under
>> Apache
>> License only.
>>
>> Regards,
>> Jacky Li
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49678.html
>> Sent from the Apache Incubator - General mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
>
--
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Luke Han <lu...@gmail.com>.
Would love to see Huawei finally decided to open source and contribute this
project to ASF.
As previous discussion, license should be very clear, I think you have a
lot of work to do:)
Thanks.
Best Regards!
---------------------
Luke Han
On Thu, May 19, 2016 at 11:46 PM, Jacky Li <13...@qq.com> wrote:
> Hi Julian Hyde,
>
> Yes, you are correct, thanks for pointing out this. Actually in early days
> of CarbonData project, it is inspired by Mondarin. Mondarin is a great OLAP
> project that we have learned much from.
>
> The code you are refering to, "CarbonDef.java, DimensionType.java,
> LevelType.java" I believe, is used in earlier version of CarbonData but it
> is no longer used in the currnet version of CarbonData. Actually there are
> quite a few packages are no longer needed but still present in the repo, so
> we are planning to clean up the code base soon.
>
> Definitely, you are right, we will make sure all source code is under
> Apache
> License only.
>
> Regards,
> Jacky Li
>
>
>
>
> --
> View this message in context:
> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49678.html
> Sent from the Apache Incubator - General mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
Re: [DISCUSS] CarbonData incubation proposal
Posted by Jacky Li <13...@qq.com>.
Hi Julian Hyde,
Yes, you are correct, thanks for pointing out this. Actually in early days
of CarbonData project, it is inspired by Mondarin. Mondarin is a great OLAP
project that we have learned much from.
The code you are refering to, "CarbonDef.java, DimensionType.java,
LevelType.java" I believe, is used in earlier version of CarbonData but it
is no longer used in the currnet version of CarbonData. Actually there are
quite a few packages are no longer needed but still present in the repo, so
we are planning to clean up the code base soon.
Definitely, you are right, we will make sure all source code is under Apache
License only.
Regards,
Jacky Li
--
View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49678.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Jacky Li <13...@qq.com>.
Hi Julien Le Dem,
I am one of the developers in CarbonData project. Thanks for pointing out
this issue. Actually, we are in a process of rapid development of this new
file format and still missed proper documentation by now.
CarbonData's goal is a columnar file format that can be used to satisfy
various query scenarios, so by design it has some unique features like
builtin multi-level index, operable encoded data, collumn group, etc. (Liang
has pointed out some of them in his last post). But since it is a columnar
file format, it shares some common terminologies with Apache Parquet and
Apache ORC, which I think it is inevitable. To reduce the confusion to
minimal in the future, I think we will improve our documentation later on.
And do you have other suggestion also?
For the file format specification, I have updated the wiki and thrift
definition to reflect the design of CarbonData. Please check whether still
have issues.
Regards,
Jacky Li
--
View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49676.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Julien Le Dem <ju...@dremio.com>.
Similar comment regarding the file format specification. It looks like this
is derived from the Parquet file format.
Which is fine as long as we follow the terms of the license:
https://github.com/apache/parquet-format/blob/master/LICENSE#L101
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
For example CarbonData:
https://github.com/HuaweiBigData/carbondata/wiki/CarbonData-File-Structure-and-Format
https://github.com/HuaweiBigData/carbondata/blob/master/format/src/main/thrift/carbondata.thrift
Parquet:
https://github.com/apache/parquet-format/blob/master/README.md
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
On Thu, May 19, 2016 at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:
> I see code derived from Mondrian in the org.carbondata.core.carbon
> package[1] (I’m familiar with Mondrian’s code structure because I wrote
> it). Mondrian was originally EPL and as such cannot be re-licensed under
> ASL. Everything is probably fine, but as part of incubation, we will need
> to make sure that this and other code has a clear progeny.
>
> Julian
>
> [1]
> https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon
> <
> https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon
> >
>
> > On May 19, 2016, at 10:04 AM, Liang Chen <ch...@huawei.com>
> wrote:
> >
> > Hi Lars
> >
> > Thanks for you participated in discussion.
> >
> > Based on the below requirements, we investigated existing file formats in
> > the Hadoop eco-system, but we could not find a suitable solution that
> > satisfying requirements all at the same time, so we start designing
> > CarbonData.
> > R1.Support big scan & only fetch a few columns
> > R2.Support primary key lookup response in sub-second.
> > R3.Support interactive OLAP-style query over big data which involve many
> > filters in a query, this type of workload should response in seconds.
> > R4.Support fast individual record extraction which fetch all columns of
> the
> > record.
> > R5.Support HDFS so that customer can leverage existing Hadoop cluster.
> >
> > When we investigate Parquet/ORC, it seems they work very well for R1 and
> R5,
> > but they does not meet for R2,R3,R4. So we designed CarbonData mainly to
> add
> > following differentiating features:
> >
> > 1.Stores data along with index: it can significantly accelerate query
> > performance and reduces the I/O scans and CPU resources, where there are
> > filters in the query. CarbonData index is consisted of multiple level, a
> > processing framework can leverage this index to reduce the task it needs
> to
> > schedule and process, and it can also do skip scan in more finer grain
> unit
> > (called blocklet) in task side scanning instead of scanning the whole
> file.
> >
> > 2.Operable encoded data :Through supporting efficient compression and
> global
> > encoding schemes, can query on compressed/encoded data, the data can be
> > converted just before returning the results to the users, which is "late
> > materialized".
> >
> > 3.Column group: Allow multiple columns form a column group to store as
> row
> > format, thus cost of column reconstructing is reduced.
> >
> > 4.Supports for various use cases with one single Data format : like
> > interactive OLAP-style query, Sequential Access (big scan), Random Access
> > (narrow scan).
> >
> > Please kindly let me know if the above info answer your questions.
> >
> > Regards
> > Liang
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49652.html
> > Sent from the Apache Incubator - General mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>
>
--
Julien
Re: [DISCUSS] CarbonData incubation proposal
Posted by Julian Hyde <jh...@apache.org>.
I see code derived from Mondrian in the org.carbondata.core.carbon package[1] (I’m familiar with Mondrian’s code structure because I wrote it). Mondrian was originally EPL and as such cannot be re-licensed under ASL. Everything is probably fine, but as part of incubation, we will need to make sure that this and other code has a clear progeny.
Julian
[1] https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon <https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon>
> On May 19, 2016, at 10:04 AM, Liang Chen <ch...@huawei.com> wrote:
>
> Hi Lars
>
> Thanks for you participated in discussion.
>
> Based on the below requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
> R1.Support big scan & only fetch a few columns
> R2.Support primary key lookup response in sub-second.
> R3.Support interactive OLAP-style query over big data which involve many
> filters in a query, this type of workload should response in seconds.
> R4.Support fast individual record extraction which fetch all columns of the
> record.
> R5.Support HDFS so that customer can leverage existing Hadoop cluster.
>
> When we investigate Parquet/ORC, it seems they work very well for R1 and R5,
> but they does not meet for R2,R3,R4. So we designed CarbonData mainly to add
> following differentiating features:
>
> 1.Stores data along with index: it can significantly accelerate query
> performance and reduces the I/O scans and CPU resources, where there are
> filters in the query. CarbonData index is consisted of multiple level, a
> processing framework can leverage this index to reduce the task it needs to
> schedule and process, and it can also do skip scan in more finer grain unit
> (called blocklet) in task side scanning instead of scanning the whole file.
>
> 2.Operable encoded data :Through supporting efficient compression and global
> encoding schemes, can query on compressed/encoded data, the data can be
> converted just before returning the results to the users, which is "late
> materialized".
>
> 3.Column group: Allow multiple columns form a column group to store as row
> format, thus cost of column reconstructing is reduced.
>
> 4.Supports for various use cases with one single Data format : like
> interactive OLAP-style query, Sequential Access (big scan), Random Access
> (narrow scan).
>
> Please kindly let me know if the above info answer your questions.
>
> Regards
> Liang
>
>
>
>
>
>
> --
> View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49652.html
> Sent from the Apache Incubator - General mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
Re: [DISCUSS] CarbonData incubation proposal
Posted by Liang Chen <ch...@huawei.com>.
Hi Lars
Thanks for you participated in discussion.
Based on the below requirements, we investigated existing file formats in
the Hadoop eco-system, but we could not find a suitable solution that
satisfying requirements all at the same time, so we start designing
CarbonData.
R1.Support big scan & only fetch a few columns
R2.Support primary key lookup response in sub-second.
R3.Support interactive OLAP-style query over big data which involve many
filters in a query, this type of workload should response in seconds.
R4.Support fast individual record extraction which fetch all columns of the
record.
R5.Support HDFS so that customer can leverage existing Hadoop cluster.
When we investigate Parquet/ORC, it seems they work very well for R1 and R5,
but they does not meet for R2,R3,R4. So we designed CarbonData mainly to add
following differentiating features:
1.Stores data along with index: it can significantly accelerate query
performance and reduces the I/O scans and CPU resources, where there are
filters in the query. CarbonData index is consisted of multiple level, a
processing framework can leverage this index to reduce the task it needs to
schedule and process, and it can also do skip scan in more finer grain unit
(called blocklet) in task side scanning instead of scanning the whole file.
2.Operable encoded data :Through supporting efficient compression and global
encoding schemes, can query on compressed/encoded data, the data can be
converted just before returning the results to the users, which is "late
materialized".
3.Column group: Allow multiple columns form a column group to store as row
format, thus cost of column reconstructing is reduced.
4.Supports for various use cases with one single Data format : like
interactive OLAP-style query, Sequential Access (big scan), Random Access
(narrow scan).
Please kindly let me know if the above info answer your questions.
Regards
Liang
--
View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49652.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Lars Francke <la...@gmail.com>.
Hi Jean-Baptiste,
can you - or anyone else for that matter - comment on how it relates to
Parquet and ORC?
The Github page says "The CaronData file format provides a highly efficient
way to store structured data,it was designed to overcome limitations of the
other Hadoop file formats." so it'd be very interesting to know which
limitations were encountered and how these are fixed here.
Good luck with the proposal.
Thank you!
Lars
On Thu, May 19, 2016 at 5:52 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:
> Hi all,
>
> We would like to discuss about a new proposal for the incubator:
> CarbonData.
>
> CarbonData is a new Apache Hadoop native file format for faster
> interactive query using advanced columnar storage, index, compression and
> encoding techniques to improve computing efficiency, in turn it will help
> speedup queries an order of magnitude faster over PetaBytes of data.
>
> The proposal is included below and also available on the wiki:
>
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Please, provide any feedback or comment.
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Backgrounad ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all
> fields.
> * Fast data loading speed and support incremental load in period of
> minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
> 1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
> 2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make storage
> optimized for different use cases, user can configure what to index, so
> user can decide and tune the format before loading data into CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
> * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
> * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
> * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community by
> running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a key
> project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project to
> fully transition to the Apache Way governance model, development must shift
> towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture far
> from their core expertises and thus will continue to be engaged with the
> project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache ‘branding’ when talking to other
> projects as testament of our project’s ‘neutrality’, we have no plans for
> making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
> * Apache Spark
> * Apache Hadoop
> * Apache Maven
> * Apache Commons
> * Apache Log4j
> * Apache Thrift
> * Apache Zookeeper
> * Scala
> * Snappy
> * Kettle (Pentaho)
> * Eigenbase
> * Fastutil
> * GSON
> * Jmockit
> * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
> * private@carbondata.incubator.apache.org (moderated subscriptions)
> * commits@carbondata.incubator.apache.org
> * dev@carbondata.incubator.apache.org
> * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
> * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
> * Liang Chenliang
> * Jean-Baptiste Onofré
> * Henry Saputra
> * Uma Maheswara Rao G
> * Jenny MA
> * Jacky Likun
> * Vimal Das Kammath
> * Jarray Qiuheng
>
> === Affiliations ===
>
> * Huawei: Liang Chenliang
> * Talend: Jean-Baptiste Onofré
> * Ebay: Henry Saputra
> * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
> * Jean-Baptiste Onofré - Apache Member
>
> === Mentors ===
>
> * Henry Saputra (eBay)
> * Jean-Baptiste Onofré (Talend)
> * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
RE: [DISCUSS] CarbonData incubation proposal
Posted by "Zheng, Kai" <ka...@intel.com>.
This sounds good to have, as a nice complement to the existing data formats. Thanks for the proposal!
Non-binding +1.
Regards,
Kai
-----Original Message-----
From: Jean-Baptiste Onofré [mailto:jb@nanthrax.net]
Sent: Wednesday, May 18, 2016 8:53 PM
To: general@incubator.apache.org
Subject: [DISCUSS] CarbonData incubation proposal
Hi all,
We would like to discuss about a new proposal for the incubator: CarbonData.
CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data.
The proposal is included below and also available on the wiki:
https://wiki.apache.org/incubator/CarbonDataProposal
Please, provide any feedback or comment.
Thanks !
Regards
JB
= Apache CarbonData =
== Abstract ==
Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data.
CarbonData github address: https://github.com/HuaweiBigData/carbondata
== Backgrounad ==
Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:
* Support interactive OLAP-style query over big data in seconds.
* Support fast query on individual record which require touching all fields.
* Fast data loading speed and support incremental load in period of minutes.
* Support HDFS so that customer can leverage existing Hadoop cluster.
* Support time based data retention.
Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.
== Rationale ==
CarbonData contains multiple modules, which are classified into two
categories:
1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
=== CarbonData File Format ===
CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
==== Indexing ====
In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
1. Multi-dimensional Key (B+ Tree index)
The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a
B+Tree with the data blocks as leaf nodes (on disk) and the remaining
non-leaf nodes in memory.
2. Inverted index
Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block.
Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
3. MinMax index
For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
==== Global Dictionary ====
Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
==== Column Group ====
Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload.
To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
==== Optimized for multiple use cases ====
CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.
For example
|| Use Case || Supporting Features ||
|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
Tree index), Minmax index, Inverted index ||
|| High throughput scan || Global dictionary, Minmax index || Low
|| latency point query || Multi-dimensional Key (B+ Tree index),
Partitioning ||
|| Individual record query || Column group, Global dictionary ||
=== BigData Processing Framework Integration ===
* CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
* CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
* CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
Example:
https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
== Initial Goals ==
Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
== Current Status ==
CarbonData is production ready and already provide a large set of features.
The current license is already Apache 2.0.
== Meritocracy ==
We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
== Community ==
If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
== Known Risks ==
Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
== Orphaned products ==
Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
== Inexperience with Open Source ==
Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
== Reliance on Salaried Developers ==
Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.
== An Excessive Fascination with the Apache Brand ==
While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.
== Initial Source ==
https://github.com/HuaweiBigData/carbondata.git
== External Dependencies ==
All external dependencies are licensed under an Apache 2.0 license or Apache-compatible license. As we grow the Carbondata community we will configure our build process to require and validate all contributions and dependencies are licensed under the Apache 2.0 license or are under an Apache-compatible license.
* Apache Spark
* Apache Hadoop
* Apache Maven
* Apache Commons
* Apache Log4j
* Apache Thrift
* Apache Zookeeper
* Scala
* Snappy
* Kettle (Pentaho)
* Eigenbase
* Fastutil
* GSON
* Jmockit
* Junit
== Required Resources ==
=== Mailing lists ===
* private@carbondata.incubator.apache.org (moderated subscriptions)
* commits@carbondata.incubator.apache.org
* dev@carbondata.incubator.apache.org
* issues@carbondata.incubator.apache.org
=== Git Repository ===
* https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
=== Issue Tracking ===
* JIRA Project CarbonData (CarbonData)
=== Initial Committers ===
* Liang Chenliang
* Jean-Baptiste Onofré
* Henry Saputra
* Uma Maheswara Rao G
* Jenny MA
* Jacky Likun
* Vimal Das Kammath
* Jarray Qiuheng
=== Affiliations ===
* Huawei: Liang Chenliang
* Talend: Jean-Baptiste Onofré
* Ebay: Henry Saputra
* Intel: Uma Maheswara Rao G
=== Sponsors ===
=== Champion ===
* Jean-Baptiste Onofré - Apache Member
=== Mentors ===
* Henry Saputra (eBay)
* Jean-Baptiste Onofré (Talend)
* Uma Maheswara Rao G (Intel)
=== Sponsoring Entity ===
The Apache Incubator
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Liang Chen <ch...@huawei.com>.
Hi Nick
Thanks for your questions.
The initial committers are contributors who are fully involved for
CarbonData project, after some time based on contribution and Meritocracy ,
we will include more contributors into the committer list.
1.Some of the committers are actually located in the US (in labs), and our
internal communication is in english for remote work.
2.Current committers are spread in different locations (from the US to
China, and India) , and work in different time zone.
3.Yes, we participated in other Apache project contribution works, Apache
Spark, Apache Flink, Apache Hadoop etc, some contribution list as below:
https://github.com/apache/spark/commits?author=jihongMA
https://github.com/apache/spark/commits?author=jackylk
https://github.com/apache/flink/commits?author=chenliang613
Regards
Liang
--
View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49794.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Re: [DISCUSS] CarbonData incubation proposal
Posted by Nick Burch <ni...@apache.org>.
On Thu, 19 May 2016, Jean-Baptiste Onofr� wrote:
> The proposal is included below and also available on the wiki:
>
> https://wiki.apache.org/incubator/CarbonDataProposal
Comparing the Initial Committers list with the Github contributors list,
there look to be a few people currently quite involved in the project not
on the initial list. Is there a reason for that? Is there a plan to try to
bring them over?
Thinking about the challenges that might be faced, I've a few non-standard
questions too
* How experienced are the team at communicating in written English?
(User support is allowed in other languages, but generally development
needs to be in English)
* How experienced is the team at working with in a distributed / remote
way?
(Before graduation, the project will need a distributed and diverse
set of contributors working in non-realtime, if things are only ever
done in Shenzhen today, then they'll need help and support to make the
change)
* Have any of the current committers contributed to other Apache
projects already?
(If a few people can / do contribute to other Apache projects, they can
learn some of the Apache Way from those communities, reducing the
amount of help they'll need from Mentors on learning it)
Thanks
Nick